131
Speech, NLP and the Web Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 1-4: Introduction, POS 21 July, 2014 Pushpak Bhattacharyya: Intro, POS 1

Speech, NLP and the Web - CSE, IIT Bombaypb/cs626-2014/cs626-lect1to4-intro-pos... · Speech, NLP and the Web Pushpak Bhattacharyya ... Word formation rules from root words ... Preposition

  • Upload
    hadiep

  • View
    235

  • Download
    2

Embed Size (px)

Citation preview

Speech NLP and the Web

Pushpak BhattacharyyaCSE Dept IIT Bombay

Lecture 1-4 Introduction POS

21 July 2014Pushpak Bhattacharyya Intro

POS 1

Basic information Slot 4 Mon- 1130 Tue- 830 Thu- 930AM Venue FC Kohli auditorium TA team Aditya Geetanjali Sandeep Sagar Naman

adityamjoshigmailcom geetanjalirakshitgmailcom mathiassandeepgmailcom sagarahirecseiitbacin namanguptacseiitbacin

Course notes httpwwwcseiitbacin~pbcs626-2014 No midsem end sem assignments and paper reading for new

entrants projects for others

21 July 2014Pushpak Bhattacharyya Intro

POS 2

NLP- a foundation Noisy Channel Model

Sequence w is transformed into sequence t

T=argmax(P(T|W))= argmax(P(T)P(W|T))w w

W=argmax(P(W|T))= argmax(P(W)P(T|W))T T

W t

3

5 representative problems using noisy channel modeling

Statistical Spell Checking Automatic Speech Recognition Part of Speech Tagging discussed in

detail in subsequent classes Probabilistic Parsing Statistical Machine Translation

21 July 2014Pushpak Bhattacharyya Intro

POS 4

Some general observationsA= argmax [P(A|B)]

A= argmax [P(A)P(B|A)]

AComputing and using P(A) and P(B|A) both need

(i) looking at the internal structures of A and B(ii) making independence assumptions(iii) putting together a computation from smaller parts

21 July 2014Pushpak Bhattacharyya Intro

POS 5

Corpus

A collection of text called corpus is used for collecting various language data

With annotation more information but manual labor intensive

Practice label automatically correct manually The famous Brown Corpus contains 1 million tagged words Switchboard very famous corpora 2400 conversations

543 speakers many US dialects annotated with orthography and phonetics

21 July 2014Pushpak Bhattacharyya Intro

POS 6

What is NLP

Branch of AI 2 Goals

Science Goal Understand the way language operates

Engineering Goal Build systems that analyse and generate language reduce the man machine gap

21 July 2014Pushpak Bhattacharyya Intro

POS 7

Perpectivising NLP Areas of AI and their inter-dependencies

Search

Vision

PlanningMachine Learning

Knowledge RepresentationLogic

Expert SystemsRoboticsNLP

21 July 2014Pushpak Bhattacharyya Intro

POS 8

NLP Two pictures

NLP

Vision Speech

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Statistics and Probability

+Knowledge Based

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 9

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Coreference

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 10

A famous sentence (12)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

21 July 2014Pushpak Bhattacharyya Intro

POS 11

A famous sentence (22)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

BuffaloAnimalCitybully

21 July 2014Pushpak Bhattacharyya Intro

POS 12

NLP multilayered multidimensional

Morphology

POS tagging

Chunking

Parsing

Semantics

Discourse and Coreference

IncreasedComplexity OfProcessing

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

Multilinguality Indian situation Major streams

Indo European Dravidian Sino Tibetan Austro-Asiatic

Some languages are ranked within 20 in the world in terms of the populations speaking them Hindi and Urdu 5th (~500

milion) Bangla 7th (~300 million) Marathi 14th (~70 million)

NLP architecture and stages of processing- ambiguity at every stage

Phonetics and phonology Morphology Lexical Analysis Syntactic Analysis Semantic Analysis Pragmatics Discourse

21 July 2014Pushpak Bhattacharyya Intro

POS 15

Phonetics processing of speech sound and associated challenges

Homophones bank (finance) vs bank (river bank) Near Homophones maatraa vs maatra (hin) Word Boundary

आजायग (aajaayenge) (aa jaayenge (will come) or aaj aayenge(will come today)

I got [ua]plate His research is in human languages

Disfluency ah um ahem etc

(near homophone trouble) The king of Abu Dhabi expired and there was national mourning for 7 days Some children were playing in the evening when a person chided them Do not play it is mourning time The children said No it is evening time and we will play

21 July 2014Pushpak Bhattacharyya Intro

POS 16

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 17

Morphology Word formation rules from root words Nouns Plural (boy-boys) Gender marking (czar-czarina) Verbs Tense (stretch-stretched) Aspect (eg perfective sit-had

sat) Modality (eg request khaanaa khaaiie) First crucial first step in NLP Languages rich in morphology eg Dravidian Hungarian

Turkish Languages poor in morphology Chinese English Languages with rich morphology have the advantage of easier

processing at higher stages of processing A task of interest to computer science Finite State Machines for

Word Morphology

21 July 2014Pushpak Bhattacharyya Intro

POS 18

Lexical Analysis

Dictionary and word properties

dognoun (lexical property)take-rsquosrsquo-in-plural (morph property)animate (semantic property)4-legged (-do-)carnivore (-do)

21 July 2014Pushpak Bhattacharyya Intro

POS 19

Lexical Disambiguationpart of Speech Disambiguation

Dog as a noun (animal) Dog as a verb (to pursue)

Sense Disambiguation Dog (as animal) Dog (as a very detestable person) The chair emphasised the need for adult education

Very common in day to day communicationsSatellite Channel Ad Watch what you want when you

want (two senses of watch)Ground breaking ceremonyresearch(ToI 14114) India eradicates polio says WHO

21 July 2014Pushpak Bhattacharyya Intro

POS 20

Technological developments bring in new terms additional meaningsnuances for existing terms

Justify as in justify the right margin (word processing context)

Xeroxed a new verb Digital Trace a new expression Communifaking pretending to talk on

mobile when you are actually not Discomgooglation anxietydiscomfort at

not being able to access internet Helicopter Parenting over parenting Obamagain Obama care modinomics

21 July 2014Pushpak Bhattacharyya Intro

POS 21

Ambiguity of Multiwords The grandfather kicked the bucket after suffering from cancer This job is a piece of cake Put the sweater on He is the dark horse of the match

Google Translations of above sentencesदादा कसर स पी ड़त होन क बाद बा ट लात मार इस काम क कक का एक टकड़ा हवटर पर रखोवह मच क अधर घोड़ा ह

21 July 2014Pushpak Bhattacharyya Intro

POS 22

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Basic information Slot 4 Mon- 1130 Tue- 830 Thu- 930AM Venue FC Kohli auditorium TA team Aditya Geetanjali Sandeep Sagar Naman

adityamjoshigmailcom geetanjalirakshitgmailcom mathiassandeepgmailcom sagarahirecseiitbacin namanguptacseiitbacin

Course notes httpwwwcseiitbacin~pbcs626-2014 No midsem end sem assignments and paper reading for new

entrants projects for others

21 July 2014Pushpak Bhattacharyya Intro

POS 2

NLP- a foundation Noisy Channel Model

Sequence w is transformed into sequence t

T=argmax(P(T|W))= argmax(P(T)P(W|T))w w

W=argmax(P(W|T))= argmax(P(W)P(T|W))T T

W t

3

5 representative problems using noisy channel modeling

Statistical Spell Checking Automatic Speech Recognition Part of Speech Tagging discussed in

detail in subsequent classes Probabilistic Parsing Statistical Machine Translation

21 July 2014Pushpak Bhattacharyya Intro

POS 4

Some general observationsA= argmax [P(A|B)]

A= argmax [P(A)P(B|A)]

AComputing and using P(A) and P(B|A) both need

(i) looking at the internal structures of A and B(ii) making independence assumptions(iii) putting together a computation from smaller parts

21 July 2014Pushpak Bhattacharyya Intro

POS 5

Corpus

A collection of text called corpus is used for collecting various language data

With annotation more information but manual labor intensive

Practice label automatically correct manually The famous Brown Corpus contains 1 million tagged words Switchboard very famous corpora 2400 conversations

543 speakers many US dialects annotated with orthography and phonetics

21 July 2014Pushpak Bhattacharyya Intro

POS 6

What is NLP

Branch of AI 2 Goals

Science Goal Understand the way language operates

Engineering Goal Build systems that analyse and generate language reduce the man machine gap

21 July 2014Pushpak Bhattacharyya Intro

POS 7

Perpectivising NLP Areas of AI and their inter-dependencies

Search

Vision

PlanningMachine Learning

Knowledge RepresentationLogic

Expert SystemsRoboticsNLP

21 July 2014Pushpak Bhattacharyya Intro

POS 8

NLP Two pictures

NLP

Vision Speech

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Statistics and Probability

+Knowledge Based

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 9

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Coreference

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 10

A famous sentence (12)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

21 July 2014Pushpak Bhattacharyya Intro

POS 11

A famous sentence (22)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

BuffaloAnimalCitybully

21 July 2014Pushpak Bhattacharyya Intro

POS 12

NLP multilayered multidimensional

Morphology

POS tagging

Chunking

Parsing

Semantics

Discourse and Coreference

IncreasedComplexity OfProcessing

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

Multilinguality Indian situation Major streams

Indo European Dravidian Sino Tibetan Austro-Asiatic

Some languages are ranked within 20 in the world in terms of the populations speaking them Hindi and Urdu 5th (~500

milion) Bangla 7th (~300 million) Marathi 14th (~70 million)

NLP architecture and stages of processing- ambiguity at every stage

Phonetics and phonology Morphology Lexical Analysis Syntactic Analysis Semantic Analysis Pragmatics Discourse

21 July 2014Pushpak Bhattacharyya Intro

POS 15

Phonetics processing of speech sound and associated challenges

Homophones bank (finance) vs bank (river bank) Near Homophones maatraa vs maatra (hin) Word Boundary

आजायग (aajaayenge) (aa jaayenge (will come) or aaj aayenge(will come today)

I got [ua]plate His research is in human languages

Disfluency ah um ahem etc

(near homophone trouble) The king of Abu Dhabi expired and there was national mourning for 7 days Some children were playing in the evening when a person chided them Do not play it is mourning time The children said No it is evening time and we will play

21 July 2014Pushpak Bhattacharyya Intro

POS 16

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 17

Morphology Word formation rules from root words Nouns Plural (boy-boys) Gender marking (czar-czarina) Verbs Tense (stretch-stretched) Aspect (eg perfective sit-had

sat) Modality (eg request khaanaa khaaiie) First crucial first step in NLP Languages rich in morphology eg Dravidian Hungarian

Turkish Languages poor in morphology Chinese English Languages with rich morphology have the advantage of easier

processing at higher stages of processing A task of interest to computer science Finite State Machines for

Word Morphology

21 July 2014Pushpak Bhattacharyya Intro

POS 18

Lexical Analysis

Dictionary and word properties

dognoun (lexical property)take-rsquosrsquo-in-plural (morph property)animate (semantic property)4-legged (-do-)carnivore (-do)

21 July 2014Pushpak Bhattacharyya Intro

POS 19

Lexical Disambiguationpart of Speech Disambiguation

Dog as a noun (animal) Dog as a verb (to pursue)

Sense Disambiguation Dog (as animal) Dog (as a very detestable person) The chair emphasised the need for adult education

Very common in day to day communicationsSatellite Channel Ad Watch what you want when you

want (two senses of watch)Ground breaking ceremonyresearch(ToI 14114) India eradicates polio says WHO

21 July 2014Pushpak Bhattacharyya Intro

POS 20

Technological developments bring in new terms additional meaningsnuances for existing terms

Justify as in justify the right margin (word processing context)

Xeroxed a new verb Digital Trace a new expression Communifaking pretending to talk on

mobile when you are actually not Discomgooglation anxietydiscomfort at

not being able to access internet Helicopter Parenting over parenting Obamagain Obama care modinomics

21 July 2014Pushpak Bhattacharyya Intro

POS 21

Ambiguity of Multiwords The grandfather kicked the bucket after suffering from cancer This job is a piece of cake Put the sweater on He is the dark horse of the match

Google Translations of above sentencesदादा कसर स पी ड़त होन क बाद बा ट लात मार इस काम क कक का एक टकड़ा हवटर पर रखोवह मच क अधर घोड़ा ह

21 July 2014Pushpak Bhattacharyya Intro

POS 22

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

NLP- a foundation Noisy Channel Model

Sequence w is transformed into sequence t

T=argmax(P(T|W))= argmax(P(T)P(W|T))w w

W=argmax(P(W|T))= argmax(P(W)P(T|W))T T

W t

3

5 representative problems using noisy channel modeling

Statistical Spell Checking Automatic Speech Recognition Part of Speech Tagging discussed in

detail in subsequent classes Probabilistic Parsing Statistical Machine Translation

21 July 2014Pushpak Bhattacharyya Intro

POS 4

Some general observationsA= argmax [P(A|B)]

A= argmax [P(A)P(B|A)]

AComputing and using P(A) and P(B|A) both need

(i) looking at the internal structures of A and B(ii) making independence assumptions(iii) putting together a computation from smaller parts

21 July 2014Pushpak Bhattacharyya Intro

POS 5

Corpus

A collection of text called corpus is used for collecting various language data

With annotation more information but manual labor intensive

Practice label automatically correct manually The famous Brown Corpus contains 1 million tagged words Switchboard very famous corpora 2400 conversations

543 speakers many US dialects annotated with orthography and phonetics

21 July 2014Pushpak Bhattacharyya Intro

POS 6

What is NLP

Branch of AI 2 Goals

Science Goal Understand the way language operates

Engineering Goal Build systems that analyse and generate language reduce the man machine gap

21 July 2014Pushpak Bhattacharyya Intro

POS 7

Perpectivising NLP Areas of AI and their inter-dependencies

Search

Vision

PlanningMachine Learning

Knowledge RepresentationLogic

Expert SystemsRoboticsNLP

21 July 2014Pushpak Bhattacharyya Intro

POS 8

NLP Two pictures

NLP

Vision Speech

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Statistics and Probability

+Knowledge Based

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 9

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Coreference

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 10

A famous sentence (12)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

21 July 2014Pushpak Bhattacharyya Intro

POS 11

A famous sentence (22)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

BuffaloAnimalCitybully

21 July 2014Pushpak Bhattacharyya Intro

POS 12

NLP multilayered multidimensional

Morphology

POS tagging

Chunking

Parsing

Semantics

Discourse and Coreference

IncreasedComplexity OfProcessing

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

Multilinguality Indian situation Major streams

Indo European Dravidian Sino Tibetan Austro-Asiatic

Some languages are ranked within 20 in the world in terms of the populations speaking them Hindi and Urdu 5th (~500

milion) Bangla 7th (~300 million) Marathi 14th (~70 million)

NLP architecture and stages of processing- ambiguity at every stage

Phonetics and phonology Morphology Lexical Analysis Syntactic Analysis Semantic Analysis Pragmatics Discourse

21 July 2014Pushpak Bhattacharyya Intro

POS 15

Phonetics processing of speech sound and associated challenges

Homophones bank (finance) vs bank (river bank) Near Homophones maatraa vs maatra (hin) Word Boundary

आजायग (aajaayenge) (aa jaayenge (will come) or aaj aayenge(will come today)

I got [ua]plate His research is in human languages

Disfluency ah um ahem etc

(near homophone trouble) The king of Abu Dhabi expired and there was national mourning for 7 days Some children were playing in the evening when a person chided them Do not play it is mourning time The children said No it is evening time and we will play

21 July 2014Pushpak Bhattacharyya Intro

POS 16

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 17

Morphology Word formation rules from root words Nouns Plural (boy-boys) Gender marking (czar-czarina) Verbs Tense (stretch-stretched) Aspect (eg perfective sit-had

sat) Modality (eg request khaanaa khaaiie) First crucial first step in NLP Languages rich in morphology eg Dravidian Hungarian

Turkish Languages poor in morphology Chinese English Languages with rich morphology have the advantage of easier

processing at higher stages of processing A task of interest to computer science Finite State Machines for

Word Morphology

21 July 2014Pushpak Bhattacharyya Intro

POS 18

Lexical Analysis

Dictionary and word properties

dognoun (lexical property)take-rsquosrsquo-in-plural (morph property)animate (semantic property)4-legged (-do-)carnivore (-do)

21 July 2014Pushpak Bhattacharyya Intro

POS 19

Lexical Disambiguationpart of Speech Disambiguation

Dog as a noun (animal) Dog as a verb (to pursue)

Sense Disambiguation Dog (as animal) Dog (as a very detestable person) The chair emphasised the need for adult education

Very common in day to day communicationsSatellite Channel Ad Watch what you want when you

want (two senses of watch)Ground breaking ceremonyresearch(ToI 14114) India eradicates polio says WHO

21 July 2014Pushpak Bhattacharyya Intro

POS 20

Technological developments bring in new terms additional meaningsnuances for existing terms

Justify as in justify the right margin (word processing context)

Xeroxed a new verb Digital Trace a new expression Communifaking pretending to talk on

mobile when you are actually not Discomgooglation anxietydiscomfort at

not being able to access internet Helicopter Parenting over parenting Obamagain Obama care modinomics

21 July 2014Pushpak Bhattacharyya Intro

POS 21

Ambiguity of Multiwords The grandfather kicked the bucket after suffering from cancer This job is a piece of cake Put the sweater on He is the dark horse of the match

Google Translations of above sentencesदादा कसर स पी ड़त होन क बाद बा ट लात मार इस काम क कक का एक टकड़ा हवटर पर रखोवह मच क अधर घोड़ा ह

21 July 2014Pushpak Bhattacharyya Intro

POS 22

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

5 representative problems using noisy channel modeling

Statistical Spell Checking Automatic Speech Recognition Part of Speech Tagging discussed in

detail in subsequent classes Probabilistic Parsing Statistical Machine Translation

21 July 2014Pushpak Bhattacharyya Intro

POS 4

Some general observationsA= argmax [P(A|B)]

A= argmax [P(A)P(B|A)]

AComputing and using P(A) and P(B|A) both need

(i) looking at the internal structures of A and B(ii) making independence assumptions(iii) putting together a computation from smaller parts

21 July 2014Pushpak Bhattacharyya Intro

POS 5

Corpus

A collection of text called corpus is used for collecting various language data

With annotation more information but manual labor intensive

Practice label automatically correct manually The famous Brown Corpus contains 1 million tagged words Switchboard very famous corpora 2400 conversations

543 speakers many US dialects annotated with orthography and phonetics

21 July 2014Pushpak Bhattacharyya Intro

POS 6

What is NLP

Branch of AI 2 Goals

Science Goal Understand the way language operates

Engineering Goal Build systems that analyse and generate language reduce the man machine gap

21 July 2014Pushpak Bhattacharyya Intro

POS 7

Perpectivising NLP Areas of AI and their inter-dependencies

Search

Vision

PlanningMachine Learning

Knowledge RepresentationLogic

Expert SystemsRoboticsNLP

21 July 2014Pushpak Bhattacharyya Intro

POS 8

NLP Two pictures

NLP

Vision Speech

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Statistics and Probability

+Knowledge Based

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 9

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Coreference

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 10

A famous sentence (12)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

21 July 2014Pushpak Bhattacharyya Intro

POS 11

A famous sentence (22)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

BuffaloAnimalCitybully

21 July 2014Pushpak Bhattacharyya Intro

POS 12

NLP multilayered multidimensional

Morphology

POS tagging

Chunking

Parsing

Semantics

Discourse and Coreference

IncreasedComplexity OfProcessing

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

Multilinguality Indian situation Major streams

Indo European Dravidian Sino Tibetan Austro-Asiatic

Some languages are ranked within 20 in the world in terms of the populations speaking them Hindi and Urdu 5th (~500

milion) Bangla 7th (~300 million) Marathi 14th (~70 million)

NLP architecture and stages of processing- ambiguity at every stage

Phonetics and phonology Morphology Lexical Analysis Syntactic Analysis Semantic Analysis Pragmatics Discourse

21 July 2014Pushpak Bhattacharyya Intro

POS 15

Phonetics processing of speech sound and associated challenges

Homophones bank (finance) vs bank (river bank) Near Homophones maatraa vs maatra (hin) Word Boundary

आजायग (aajaayenge) (aa jaayenge (will come) or aaj aayenge(will come today)

I got [ua]plate His research is in human languages

Disfluency ah um ahem etc

(near homophone trouble) The king of Abu Dhabi expired and there was national mourning for 7 days Some children were playing in the evening when a person chided them Do not play it is mourning time The children said No it is evening time and we will play

21 July 2014Pushpak Bhattacharyya Intro

POS 16

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 17

Morphology Word formation rules from root words Nouns Plural (boy-boys) Gender marking (czar-czarina) Verbs Tense (stretch-stretched) Aspect (eg perfective sit-had

sat) Modality (eg request khaanaa khaaiie) First crucial first step in NLP Languages rich in morphology eg Dravidian Hungarian

Turkish Languages poor in morphology Chinese English Languages with rich morphology have the advantage of easier

processing at higher stages of processing A task of interest to computer science Finite State Machines for

Word Morphology

21 July 2014Pushpak Bhattacharyya Intro

POS 18

Lexical Analysis

Dictionary and word properties

dognoun (lexical property)take-rsquosrsquo-in-plural (morph property)animate (semantic property)4-legged (-do-)carnivore (-do)

21 July 2014Pushpak Bhattacharyya Intro

POS 19

Lexical Disambiguationpart of Speech Disambiguation

Dog as a noun (animal) Dog as a verb (to pursue)

Sense Disambiguation Dog (as animal) Dog (as a very detestable person) The chair emphasised the need for adult education

Very common in day to day communicationsSatellite Channel Ad Watch what you want when you

want (two senses of watch)Ground breaking ceremonyresearch(ToI 14114) India eradicates polio says WHO

21 July 2014Pushpak Bhattacharyya Intro

POS 20

Technological developments bring in new terms additional meaningsnuances for existing terms

Justify as in justify the right margin (word processing context)

Xeroxed a new verb Digital Trace a new expression Communifaking pretending to talk on

mobile when you are actually not Discomgooglation anxietydiscomfort at

not being able to access internet Helicopter Parenting over parenting Obamagain Obama care modinomics

21 July 2014Pushpak Bhattacharyya Intro

POS 21

Ambiguity of Multiwords The grandfather kicked the bucket after suffering from cancer This job is a piece of cake Put the sweater on He is the dark horse of the match

Google Translations of above sentencesदादा कसर स पी ड़त होन क बाद बा ट लात मार इस काम क कक का एक टकड़ा हवटर पर रखोवह मच क अधर घोड़ा ह

21 July 2014Pushpak Bhattacharyya Intro

POS 22

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Some general observationsA= argmax [P(A|B)]

A= argmax [P(A)P(B|A)]

AComputing and using P(A) and P(B|A) both need

(i) looking at the internal structures of A and B(ii) making independence assumptions(iii) putting together a computation from smaller parts

21 July 2014Pushpak Bhattacharyya Intro

POS 5

Corpus

A collection of text called corpus is used for collecting various language data

With annotation more information but manual labor intensive

Practice label automatically correct manually The famous Brown Corpus contains 1 million tagged words Switchboard very famous corpora 2400 conversations

543 speakers many US dialects annotated with orthography and phonetics

21 July 2014Pushpak Bhattacharyya Intro

POS 6

What is NLP

Branch of AI 2 Goals

Science Goal Understand the way language operates

Engineering Goal Build systems that analyse and generate language reduce the man machine gap

21 July 2014Pushpak Bhattacharyya Intro

POS 7

Perpectivising NLP Areas of AI and their inter-dependencies

Search

Vision

PlanningMachine Learning

Knowledge RepresentationLogic

Expert SystemsRoboticsNLP

21 July 2014Pushpak Bhattacharyya Intro

POS 8

NLP Two pictures

NLP

Vision Speech

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Statistics and Probability

+Knowledge Based

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 9

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Coreference

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 10

A famous sentence (12)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

21 July 2014Pushpak Bhattacharyya Intro

POS 11

A famous sentence (22)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

BuffaloAnimalCitybully

21 July 2014Pushpak Bhattacharyya Intro

POS 12

NLP multilayered multidimensional

Morphology

POS tagging

Chunking

Parsing

Semantics

Discourse and Coreference

IncreasedComplexity OfProcessing

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

Multilinguality Indian situation Major streams

Indo European Dravidian Sino Tibetan Austro-Asiatic

Some languages are ranked within 20 in the world in terms of the populations speaking them Hindi and Urdu 5th (~500

milion) Bangla 7th (~300 million) Marathi 14th (~70 million)

NLP architecture and stages of processing- ambiguity at every stage

Phonetics and phonology Morphology Lexical Analysis Syntactic Analysis Semantic Analysis Pragmatics Discourse

21 July 2014Pushpak Bhattacharyya Intro

POS 15

Phonetics processing of speech sound and associated challenges

Homophones bank (finance) vs bank (river bank) Near Homophones maatraa vs maatra (hin) Word Boundary

आजायग (aajaayenge) (aa jaayenge (will come) or aaj aayenge(will come today)

I got [ua]plate His research is in human languages

Disfluency ah um ahem etc

(near homophone trouble) The king of Abu Dhabi expired and there was national mourning for 7 days Some children were playing in the evening when a person chided them Do not play it is mourning time The children said No it is evening time and we will play

21 July 2014Pushpak Bhattacharyya Intro

POS 16

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 17

Morphology Word formation rules from root words Nouns Plural (boy-boys) Gender marking (czar-czarina) Verbs Tense (stretch-stretched) Aspect (eg perfective sit-had

sat) Modality (eg request khaanaa khaaiie) First crucial first step in NLP Languages rich in morphology eg Dravidian Hungarian

Turkish Languages poor in morphology Chinese English Languages with rich morphology have the advantage of easier

processing at higher stages of processing A task of interest to computer science Finite State Machines for

Word Morphology

21 July 2014Pushpak Bhattacharyya Intro

POS 18

Lexical Analysis

Dictionary and word properties

dognoun (lexical property)take-rsquosrsquo-in-plural (morph property)animate (semantic property)4-legged (-do-)carnivore (-do)

21 July 2014Pushpak Bhattacharyya Intro

POS 19

Lexical Disambiguationpart of Speech Disambiguation

Dog as a noun (animal) Dog as a verb (to pursue)

Sense Disambiguation Dog (as animal) Dog (as a very detestable person) The chair emphasised the need for adult education

Very common in day to day communicationsSatellite Channel Ad Watch what you want when you

want (two senses of watch)Ground breaking ceremonyresearch(ToI 14114) India eradicates polio says WHO

21 July 2014Pushpak Bhattacharyya Intro

POS 20

Technological developments bring in new terms additional meaningsnuances for existing terms

Justify as in justify the right margin (word processing context)

Xeroxed a new verb Digital Trace a new expression Communifaking pretending to talk on

mobile when you are actually not Discomgooglation anxietydiscomfort at

not being able to access internet Helicopter Parenting over parenting Obamagain Obama care modinomics

21 July 2014Pushpak Bhattacharyya Intro

POS 21

Ambiguity of Multiwords The grandfather kicked the bucket after suffering from cancer This job is a piece of cake Put the sweater on He is the dark horse of the match

Google Translations of above sentencesदादा कसर स पी ड़त होन क बाद बा ट लात मार इस काम क कक का एक टकड़ा हवटर पर रखोवह मच क अधर घोड़ा ह

21 July 2014Pushpak Bhattacharyya Intro

POS 22

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Corpus

A collection of text called corpus is used for collecting various language data

With annotation more information but manual labor intensive

Practice label automatically correct manually The famous Brown Corpus contains 1 million tagged words Switchboard very famous corpora 2400 conversations

543 speakers many US dialects annotated with orthography and phonetics

21 July 2014Pushpak Bhattacharyya Intro

POS 6

What is NLP

Branch of AI 2 Goals

Science Goal Understand the way language operates

Engineering Goal Build systems that analyse and generate language reduce the man machine gap

21 July 2014Pushpak Bhattacharyya Intro

POS 7

Perpectivising NLP Areas of AI and their inter-dependencies

Search

Vision

PlanningMachine Learning

Knowledge RepresentationLogic

Expert SystemsRoboticsNLP

21 July 2014Pushpak Bhattacharyya Intro

POS 8

NLP Two pictures

NLP

Vision Speech

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Statistics and Probability

+Knowledge Based

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 9

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Coreference

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 10

A famous sentence (12)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

21 July 2014Pushpak Bhattacharyya Intro

POS 11

A famous sentence (22)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

BuffaloAnimalCitybully

21 July 2014Pushpak Bhattacharyya Intro

POS 12

NLP multilayered multidimensional

Morphology

POS tagging

Chunking

Parsing

Semantics

Discourse and Coreference

IncreasedComplexity OfProcessing

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

Multilinguality Indian situation Major streams

Indo European Dravidian Sino Tibetan Austro-Asiatic

Some languages are ranked within 20 in the world in terms of the populations speaking them Hindi and Urdu 5th (~500

milion) Bangla 7th (~300 million) Marathi 14th (~70 million)

NLP architecture and stages of processing- ambiguity at every stage

Phonetics and phonology Morphology Lexical Analysis Syntactic Analysis Semantic Analysis Pragmatics Discourse

21 July 2014Pushpak Bhattacharyya Intro

POS 15

Phonetics processing of speech sound and associated challenges

Homophones bank (finance) vs bank (river bank) Near Homophones maatraa vs maatra (hin) Word Boundary

आजायग (aajaayenge) (aa jaayenge (will come) or aaj aayenge(will come today)

I got [ua]plate His research is in human languages

Disfluency ah um ahem etc

(near homophone trouble) The king of Abu Dhabi expired and there was national mourning for 7 days Some children were playing in the evening when a person chided them Do not play it is mourning time The children said No it is evening time and we will play

21 July 2014Pushpak Bhattacharyya Intro

POS 16

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 17

Morphology Word formation rules from root words Nouns Plural (boy-boys) Gender marking (czar-czarina) Verbs Tense (stretch-stretched) Aspect (eg perfective sit-had

sat) Modality (eg request khaanaa khaaiie) First crucial first step in NLP Languages rich in morphology eg Dravidian Hungarian

Turkish Languages poor in morphology Chinese English Languages with rich morphology have the advantage of easier

processing at higher stages of processing A task of interest to computer science Finite State Machines for

Word Morphology

21 July 2014Pushpak Bhattacharyya Intro

POS 18

Lexical Analysis

Dictionary and word properties

dognoun (lexical property)take-rsquosrsquo-in-plural (morph property)animate (semantic property)4-legged (-do-)carnivore (-do)

21 July 2014Pushpak Bhattacharyya Intro

POS 19

Lexical Disambiguationpart of Speech Disambiguation

Dog as a noun (animal) Dog as a verb (to pursue)

Sense Disambiguation Dog (as animal) Dog (as a very detestable person) The chair emphasised the need for adult education

Very common in day to day communicationsSatellite Channel Ad Watch what you want when you

want (two senses of watch)Ground breaking ceremonyresearch(ToI 14114) India eradicates polio says WHO

21 July 2014Pushpak Bhattacharyya Intro

POS 20

Technological developments bring in new terms additional meaningsnuances for existing terms

Justify as in justify the right margin (word processing context)

Xeroxed a new verb Digital Trace a new expression Communifaking pretending to talk on

mobile when you are actually not Discomgooglation anxietydiscomfort at

not being able to access internet Helicopter Parenting over parenting Obamagain Obama care modinomics

21 July 2014Pushpak Bhattacharyya Intro

POS 21

Ambiguity of Multiwords The grandfather kicked the bucket after suffering from cancer This job is a piece of cake Put the sweater on He is the dark horse of the match

Google Translations of above sentencesदादा कसर स पी ड़त होन क बाद बा ट लात मार इस काम क कक का एक टकड़ा हवटर पर रखोवह मच क अधर घोड़ा ह

21 July 2014Pushpak Bhattacharyya Intro

POS 22

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

What is NLP

Branch of AI 2 Goals

Science Goal Understand the way language operates

Engineering Goal Build systems that analyse and generate language reduce the man machine gap

21 July 2014Pushpak Bhattacharyya Intro

POS 7

Perpectivising NLP Areas of AI and their inter-dependencies

Search

Vision

PlanningMachine Learning

Knowledge RepresentationLogic

Expert SystemsRoboticsNLP

21 July 2014Pushpak Bhattacharyya Intro

POS 8

NLP Two pictures

NLP

Vision Speech

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Statistics and Probability

+Knowledge Based

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 9

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Coreference

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 10

A famous sentence (12)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

21 July 2014Pushpak Bhattacharyya Intro

POS 11

A famous sentence (22)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

BuffaloAnimalCitybully

21 July 2014Pushpak Bhattacharyya Intro

POS 12

NLP multilayered multidimensional

Morphology

POS tagging

Chunking

Parsing

Semantics

Discourse and Coreference

IncreasedComplexity OfProcessing

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

Multilinguality Indian situation Major streams

Indo European Dravidian Sino Tibetan Austro-Asiatic

Some languages are ranked within 20 in the world in terms of the populations speaking them Hindi and Urdu 5th (~500

milion) Bangla 7th (~300 million) Marathi 14th (~70 million)

NLP architecture and stages of processing- ambiguity at every stage

Phonetics and phonology Morphology Lexical Analysis Syntactic Analysis Semantic Analysis Pragmatics Discourse

21 July 2014Pushpak Bhattacharyya Intro

POS 15

Phonetics processing of speech sound and associated challenges

Homophones bank (finance) vs bank (river bank) Near Homophones maatraa vs maatra (hin) Word Boundary

आजायग (aajaayenge) (aa jaayenge (will come) or aaj aayenge(will come today)

I got [ua]plate His research is in human languages

Disfluency ah um ahem etc

(near homophone trouble) The king of Abu Dhabi expired and there was national mourning for 7 days Some children were playing in the evening when a person chided them Do not play it is mourning time The children said No it is evening time and we will play

21 July 2014Pushpak Bhattacharyya Intro

POS 16

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 17

Morphology Word formation rules from root words Nouns Plural (boy-boys) Gender marking (czar-czarina) Verbs Tense (stretch-stretched) Aspect (eg perfective sit-had

sat) Modality (eg request khaanaa khaaiie) First crucial first step in NLP Languages rich in morphology eg Dravidian Hungarian

Turkish Languages poor in morphology Chinese English Languages with rich morphology have the advantage of easier

processing at higher stages of processing A task of interest to computer science Finite State Machines for

Word Morphology

21 July 2014Pushpak Bhattacharyya Intro

POS 18

Lexical Analysis

Dictionary and word properties

dognoun (lexical property)take-rsquosrsquo-in-plural (morph property)animate (semantic property)4-legged (-do-)carnivore (-do)

21 July 2014Pushpak Bhattacharyya Intro

POS 19

Lexical Disambiguationpart of Speech Disambiguation

Dog as a noun (animal) Dog as a verb (to pursue)

Sense Disambiguation Dog (as animal) Dog (as a very detestable person) The chair emphasised the need for adult education

Very common in day to day communicationsSatellite Channel Ad Watch what you want when you

want (two senses of watch)Ground breaking ceremonyresearch(ToI 14114) India eradicates polio says WHO

21 July 2014Pushpak Bhattacharyya Intro

POS 20

Technological developments bring in new terms additional meaningsnuances for existing terms

Justify as in justify the right margin (word processing context)

Xeroxed a new verb Digital Trace a new expression Communifaking pretending to talk on

mobile when you are actually not Discomgooglation anxietydiscomfort at

not being able to access internet Helicopter Parenting over parenting Obamagain Obama care modinomics

21 July 2014Pushpak Bhattacharyya Intro

POS 21

Ambiguity of Multiwords The grandfather kicked the bucket after suffering from cancer This job is a piece of cake Put the sweater on He is the dark horse of the match

Google Translations of above sentencesदादा कसर स पी ड़त होन क बाद बा ट लात मार इस काम क कक का एक टकड़ा हवटर पर रखोवह मच क अधर घोड़ा ह

21 July 2014Pushpak Bhattacharyya Intro

POS 22

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Perpectivising NLP Areas of AI and their inter-dependencies

Search

Vision

PlanningMachine Learning

Knowledge RepresentationLogic

Expert SystemsRoboticsNLP

21 July 2014Pushpak Bhattacharyya Intro

POS 8

NLP Two pictures

NLP

Vision Speech

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Statistics and Probability

+Knowledge Based

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 9

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Coreference

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 10

A famous sentence (12)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

21 July 2014Pushpak Bhattacharyya Intro

POS 11

A famous sentence (22)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

BuffaloAnimalCitybully

21 July 2014Pushpak Bhattacharyya Intro

POS 12

NLP multilayered multidimensional

Morphology

POS tagging

Chunking

Parsing

Semantics

Discourse and Coreference

IncreasedComplexity OfProcessing

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

Multilinguality Indian situation Major streams

Indo European Dravidian Sino Tibetan Austro-Asiatic

Some languages are ranked within 20 in the world in terms of the populations speaking them Hindi and Urdu 5th (~500

milion) Bangla 7th (~300 million) Marathi 14th (~70 million)

NLP architecture and stages of processing- ambiguity at every stage

Phonetics and phonology Morphology Lexical Analysis Syntactic Analysis Semantic Analysis Pragmatics Discourse

21 July 2014Pushpak Bhattacharyya Intro

POS 15

Phonetics processing of speech sound and associated challenges

Homophones bank (finance) vs bank (river bank) Near Homophones maatraa vs maatra (hin) Word Boundary

आजायग (aajaayenge) (aa jaayenge (will come) or aaj aayenge(will come today)

I got [ua]plate His research is in human languages

Disfluency ah um ahem etc

(near homophone trouble) The king of Abu Dhabi expired and there was national mourning for 7 days Some children were playing in the evening when a person chided them Do not play it is mourning time The children said No it is evening time and we will play

21 July 2014Pushpak Bhattacharyya Intro

POS 16

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 17

Morphology Word formation rules from root words Nouns Plural (boy-boys) Gender marking (czar-czarina) Verbs Tense (stretch-stretched) Aspect (eg perfective sit-had

sat) Modality (eg request khaanaa khaaiie) First crucial first step in NLP Languages rich in morphology eg Dravidian Hungarian

Turkish Languages poor in morphology Chinese English Languages with rich morphology have the advantage of easier

processing at higher stages of processing A task of interest to computer science Finite State Machines for

Word Morphology

21 July 2014Pushpak Bhattacharyya Intro

POS 18

Lexical Analysis

Dictionary and word properties

dognoun (lexical property)take-rsquosrsquo-in-plural (morph property)animate (semantic property)4-legged (-do-)carnivore (-do)

21 July 2014Pushpak Bhattacharyya Intro

POS 19

Lexical Disambiguationpart of Speech Disambiguation

Dog as a noun (animal) Dog as a verb (to pursue)

Sense Disambiguation Dog (as animal) Dog (as a very detestable person) The chair emphasised the need for adult education

Very common in day to day communicationsSatellite Channel Ad Watch what you want when you

want (two senses of watch)Ground breaking ceremonyresearch(ToI 14114) India eradicates polio says WHO

21 July 2014Pushpak Bhattacharyya Intro

POS 20

Technological developments bring in new terms additional meaningsnuances for existing terms

Justify as in justify the right margin (word processing context)

Xeroxed a new verb Digital Trace a new expression Communifaking pretending to talk on

mobile when you are actually not Discomgooglation anxietydiscomfort at

not being able to access internet Helicopter Parenting over parenting Obamagain Obama care modinomics

21 July 2014Pushpak Bhattacharyya Intro

POS 21

Ambiguity of Multiwords The grandfather kicked the bucket after suffering from cancer This job is a piece of cake Put the sweater on He is the dark horse of the match

Google Translations of above sentencesदादा कसर स पी ड़त होन क बाद बा ट लात मार इस काम क कक का एक टकड़ा हवटर पर रखोवह मच क अधर घोड़ा ह

21 July 2014Pushpak Bhattacharyya Intro

POS 22

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

NLP Two pictures

NLP

Vision Speech

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Statistics and Probability

+Knowledge Based

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 9

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Coreference

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 10

A famous sentence (12)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

21 July 2014Pushpak Bhattacharyya Intro

POS 11

A famous sentence (22)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

BuffaloAnimalCitybully

21 July 2014Pushpak Bhattacharyya Intro

POS 12

NLP multilayered multidimensional

Morphology

POS tagging

Chunking

Parsing

Semantics

Discourse and Coreference

IncreasedComplexity OfProcessing

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

Multilinguality Indian situation Major streams

Indo European Dravidian Sino Tibetan Austro-Asiatic

Some languages are ranked within 20 in the world in terms of the populations speaking them Hindi and Urdu 5th (~500

milion) Bangla 7th (~300 million) Marathi 14th (~70 million)

NLP architecture and stages of processing- ambiguity at every stage

Phonetics and phonology Morphology Lexical Analysis Syntactic Analysis Semantic Analysis Pragmatics Discourse

21 July 2014Pushpak Bhattacharyya Intro

POS 15

Phonetics processing of speech sound and associated challenges

Homophones bank (finance) vs bank (river bank) Near Homophones maatraa vs maatra (hin) Word Boundary

आजायग (aajaayenge) (aa jaayenge (will come) or aaj aayenge(will come today)

I got [ua]plate His research is in human languages

Disfluency ah um ahem etc

(near homophone trouble) The king of Abu Dhabi expired and there was national mourning for 7 days Some children were playing in the evening when a person chided them Do not play it is mourning time The children said No it is evening time and we will play

21 July 2014Pushpak Bhattacharyya Intro

POS 16

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 17

Morphology Word formation rules from root words Nouns Plural (boy-boys) Gender marking (czar-czarina) Verbs Tense (stretch-stretched) Aspect (eg perfective sit-had

sat) Modality (eg request khaanaa khaaiie) First crucial first step in NLP Languages rich in morphology eg Dravidian Hungarian

Turkish Languages poor in morphology Chinese English Languages with rich morphology have the advantage of easier

processing at higher stages of processing A task of interest to computer science Finite State Machines for

Word Morphology

21 July 2014Pushpak Bhattacharyya Intro

POS 18

Lexical Analysis

Dictionary and word properties

dognoun (lexical property)take-rsquosrsquo-in-plural (morph property)animate (semantic property)4-legged (-do-)carnivore (-do)

21 July 2014Pushpak Bhattacharyya Intro

POS 19

Lexical Disambiguationpart of Speech Disambiguation

Dog as a noun (animal) Dog as a verb (to pursue)

Sense Disambiguation Dog (as animal) Dog (as a very detestable person) The chair emphasised the need for adult education

Very common in day to day communicationsSatellite Channel Ad Watch what you want when you

want (two senses of watch)Ground breaking ceremonyresearch(ToI 14114) India eradicates polio says WHO

21 July 2014Pushpak Bhattacharyya Intro

POS 20

Technological developments bring in new terms additional meaningsnuances for existing terms

Justify as in justify the right margin (word processing context)

Xeroxed a new verb Digital Trace a new expression Communifaking pretending to talk on

mobile when you are actually not Discomgooglation anxietydiscomfort at

not being able to access internet Helicopter Parenting over parenting Obamagain Obama care modinomics

21 July 2014Pushpak Bhattacharyya Intro

POS 21

Ambiguity of Multiwords The grandfather kicked the bucket after suffering from cancer This job is a piece of cake Put the sweater on He is the dark horse of the match

Google Translations of above sentencesदादा कसर स पी ड़त होन क बाद बा ट लात मार इस काम क कक का एक टकड़ा हवटर पर रखोवह मच क अधर घोड़ा ह

21 July 2014Pushpak Bhattacharyya Intro

POS 22

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Coreference

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 10

A famous sentence (12)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

21 July 2014Pushpak Bhattacharyya Intro

POS 11

A famous sentence (22)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

BuffaloAnimalCitybully

21 July 2014Pushpak Bhattacharyya Intro

POS 12

NLP multilayered multidimensional

Morphology

POS tagging

Chunking

Parsing

Semantics

Discourse and Coreference

IncreasedComplexity OfProcessing

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

Multilinguality Indian situation Major streams

Indo European Dravidian Sino Tibetan Austro-Asiatic

Some languages are ranked within 20 in the world in terms of the populations speaking them Hindi and Urdu 5th (~500

milion) Bangla 7th (~300 million) Marathi 14th (~70 million)

NLP architecture and stages of processing- ambiguity at every stage

Phonetics and phonology Morphology Lexical Analysis Syntactic Analysis Semantic Analysis Pragmatics Discourse

21 July 2014Pushpak Bhattacharyya Intro

POS 15

Phonetics processing of speech sound and associated challenges

Homophones bank (finance) vs bank (river bank) Near Homophones maatraa vs maatra (hin) Word Boundary

आजायग (aajaayenge) (aa jaayenge (will come) or aaj aayenge(will come today)

I got [ua]plate His research is in human languages

Disfluency ah um ahem etc

(near homophone trouble) The king of Abu Dhabi expired and there was national mourning for 7 days Some children were playing in the evening when a person chided them Do not play it is mourning time The children said No it is evening time and we will play

21 July 2014Pushpak Bhattacharyya Intro

POS 16

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 17

Morphology Word formation rules from root words Nouns Plural (boy-boys) Gender marking (czar-czarina) Verbs Tense (stretch-stretched) Aspect (eg perfective sit-had

sat) Modality (eg request khaanaa khaaiie) First crucial first step in NLP Languages rich in morphology eg Dravidian Hungarian

Turkish Languages poor in morphology Chinese English Languages with rich morphology have the advantage of easier

processing at higher stages of processing A task of interest to computer science Finite State Machines for

Word Morphology

21 July 2014Pushpak Bhattacharyya Intro

POS 18

Lexical Analysis

Dictionary and word properties

dognoun (lexical property)take-rsquosrsquo-in-plural (morph property)animate (semantic property)4-legged (-do-)carnivore (-do)

21 July 2014Pushpak Bhattacharyya Intro

POS 19

Lexical Disambiguationpart of Speech Disambiguation

Dog as a noun (animal) Dog as a verb (to pursue)

Sense Disambiguation Dog (as animal) Dog (as a very detestable person) The chair emphasised the need for adult education

Very common in day to day communicationsSatellite Channel Ad Watch what you want when you

want (two senses of watch)Ground breaking ceremonyresearch(ToI 14114) India eradicates polio says WHO

21 July 2014Pushpak Bhattacharyya Intro

POS 20

Technological developments bring in new terms additional meaningsnuances for existing terms

Justify as in justify the right margin (word processing context)

Xeroxed a new verb Digital Trace a new expression Communifaking pretending to talk on

mobile when you are actually not Discomgooglation anxietydiscomfort at

not being able to access internet Helicopter Parenting over parenting Obamagain Obama care modinomics

21 July 2014Pushpak Bhattacharyya Intro

POS 21

Ambiguity of Multiwords The grandfather kicked the bucket after suffering from cancer This job is a piece of cake Put the sweater on He is the dark horse of the match

Google Translations of above sentencesदादा कसर स पी ड़त होन क बाद बा ट लात मार इस काम क कक का एक टकड़ा हवटर पर रखोवह मच क अधर घोड़ा ह

21 July 2014Pushpak Bhattacharyya Intro

POS 22

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

A famous sentence (12)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

21 July 2014Pushpak Bhattacharyya Intro

POS 11

A famous sentence (22)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

BuffaloAnimalCitybully

21 July 2014Pushpak Bhattacharyya Intro

POS 12

NLP multilayered multidimensional

Morphology

POS tagging

Chunking

Parsing

Semantics

Discourse and Coreference

IncreasedComplexity OfProcessing

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

Multilinguality Indian situation Major streams

Indo European Dravidian Sino Tibetan Austro-Asiatic

Some languages are ranked within 20 in the world in terms of the populations speaking them Hindi and Urdu 5th (~500

milion) Bangla 7th (~300 million) Marathi 14th (~70 million)

NLP architecture and stages of processing- ambiguity at every stage

Phonetics and phonology Morphology Lexical Analysis Syntactic Analysis Semantic Analysis Pragmatics Discourse

21 July 2014Pushpak Bhattacharyya Intro

POS 15

Phonetics processing of speech sound and associated challenges

Homophones bank (finance) vs bank (river bank) Near Homophones maatraa vs maatra (hin) Word Boundary

आजायग (aajaayenge) (aa jaayenge (will come) or aaj aayenge(will come today)

I got [ua]plate His research is in human languages

Disfluency ah um ahem etc

(near homophone trouble) The king of Abu Dhabi expired and there was national mourning for 7 days Some children were playing in the evening when a person chided them Do not play it is mourning time The children said No it is evening time and we will play

21 July 2014Pushpak Bhattacharyya Intro

POS 16

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 17

Morphology Word formation rules from root words Nouns Plural (boy-boys) Gender marking (czar-czarina) Verbs Tense (stretch-stretched) Aspect (eg perfective sit-had

sat) Modality (eg request khaanaa khaaiie) First crucial first step in NLP Languages rich in morphology eg Dravidian Hungarian

Turkish Languages poor in morphology Chinese English Languages with rich morphology have the advantage of easier

processing at higher stages of processing A task of interest to computer science Finite State Machines for

Word Morphology

21 July 2014Pushpak Bhattacharyya Intro

POS 18

Lexical Analysis

Dictionary and word properties

dognoun (lexical property)take-rsquosrsquo-in-plural (morph property)animate (semantic property)4-legged (-do-)carnivore (-do)

21 July 2014Pushpak Bhattacharyya Intro

POS 19

Lexical Disambiguationpart of Speech Disambiguation

Dog as a noun (animal) Dog as a verb (to pursue)

Sense Disambiguation Dog (as animal) Dog (as a very detestable person) The chair emphasised the need for adult education

Very common in day to day communicationsSatellite Channel Ad Watch what you want when you

want (two senses of watch)Ground breaking ceremonyresearch(ToI 14114) India eradicates polio says WHO

21 July 2014Pushpak Bhattacharyya Intro

POS 20

Technological developments bring in new terms additional meaningsnuances for existing terms

Justify as in justify the right margin (word processing context)

Xeroxed a new verb Digital Trace a new expression Communifaking pretending to talk on

mobile when you are actually not Discomgooglation anxietydiscomfort at

not being able to access internet Helicopter Parenting over parenting Obamagain Obama care modinomics

21 July 2014Pushpak Bhattacharyya Intro

POS 21

Ambiguity of Multiwords The grandfather kicked the bucket after suffering from cancer This job is a piece of cake Put the sweater on He is the dark horse of the match

Google Translations of above sentencesदादा कसर स पी ड़त होन क बाद बा ट लात मार इस काम क कक का एक टकड़ा हवटर पर रखोवह मच क अधर घोड़ा ह

21 July 2014Pushpak Bhattacharyya Intro

POS 22

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

A famous sentence (22)

ldquoBuffalo buffaloes Buffalo buffaloes buffalo buffalo Buffalo buffaloes Buffalo buffaloes buffalo

BuffaloAnimalCitybully

21 July 2014Pushpak Bhattacharyya Intro

POS 12

NLP multilayered multidimensional

Morphology

POS tagging

Chunking

Parsing

Semantics

Discourse and Coreference

IncreasedComplexity OfProcessing

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

Multilinguality Indian situation Major streams

Indo European Dravidian Sino Tibetan Austro-Asiatic

Some languages are ranked within 20 in the world in terms of the populations speaking them Hindi and Urdu 5th (~500

milion) Bangla 7th (~300 million) Marathi 14th (~70 million)

NLP architecture and stages of processing- ambiguity at every stage

Phonetics and phonology Morphology Lexical Analysis Syntactic Analysis Semantic Analysis Pragmatics Discourse

21 July 2014Pushpak Bhattacharyya Intro

POS 15

Phonetics processing of speech sound and associated challenges

Homophones bank (finance) vs bank (river bank) Near Homophones maatraa vs maatra (hin) Word Boundary

आजायग (aajaayenge) (aa jaayenge (will come) or aaj aayenge(will come today)

I got [ua]plate His research is in human languages

Disfluency ah um ahem etc

(near homophone trouble) The king of Abu Dhabi expired and there was national mourning for 7 days Some children were playing in the evening when a person chided them Do not play it is mourning time The children said No it is evening time and we will play

21 July 2014Pushpak Bhattacharyya Intro

POS 16

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 17

Morphology Word formation rules from root words Nouns Plural (boy-boys) Gender marking (czar-czarina) Verbs Tense (stretch-stretched) Aspect (eg perfective sit-had

sat) Modality (eg request khaanaa khaaiie) First crucial first step in NLP Languages rich in morphology eg Dravidian Hungarian

Turkish Languages poor in morphology Chinese English Languages with rich morphology have the advantage of easier

processing at higher stages of processing A task of interest to computer science Finite State Machines for

Word Morphology

21 July 2014Pushpak Bhattacharyya Intro

POS 18

Lexical Analysis

Dictionary and word properties

dognoun (lexical property)take-rsquosrsquo-in-plural (morph property)animate (semantic property)4-legged (-do-)carnivore (-do)

21 July 2014Pushpak Bhattacharyya Intro

POS 19

Lexical Disambiguationpart of Speech Disambiguation

Dog as a noun (animal) Dog as a verb (to pursue)

Sense Disambiguation Dog (as animal) Dog (as a very detestable person) The chair emphasised the need for adult education

Very common in day to day communicationsSatellite Channel Ad Watch what you want when you

want (two senses of watch)Ground breaking ceremonyresearch(ToI 14114) India eradicates polio says WHO

21 July 2014Pushpak Bhattacharyya Intro

POS 20

Technological developments bring in new terms additional meaningsnuances for existing terms

Justify as in justify the right margin (word processing context)

Xeroxed a new verb Digital Trace a new expression Communifaking pretending to talk on

mobile when you are actually not Discomgooglation anxietydiscomfort at

not being able to access internet Helicopter Parenting over parenting Obamagain Obama care modinomics

21 July 2014Pushpak Bhattacharyya Intro

POS 21

Ambiguity of Multiwords The grandfather kicked the bucket after suffering from cancer This job is a piece of cake Put the sweater on He is the dark horse of the match

Google Translations of above sentencesदादा कसर स पी ड़त होन क बाद बा ट लात मार इस काम क कक का एक टकड़ा हवटर पर रखोवह मच क अधर घोड़ा ह

21 July 2014Pushpak Bhattacharyya Intro

POS 22

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

NLP multilayered multidimensional

Morphology

POS tagging

Chunking

Parsing

Semantics

Discourse and Coreference

IncreasedComplexity OfProcessing

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

Multilinguality Indian situation Major streams

Indo European Dravidian Sino Tibetan Austro-Asiatic

Some languages are ranked within 20 in the world in terms of the populations speaking them Hindi and Urdu 5th (~500

milion) Bangla 7th (~300 million) Marathi 14th (~70 million)

NLP architecture and stages of processing- ambiguity at every stage

Phonetics and phonology Morphology Lexical Analysis Syntactic Analysis Semantic Analysis Pragmatics Discourse

21 July 2014Pushpak Bhattacharyya Intro

POS 15

Phonetics processing of speech sound and associated challenges

Homophones bank (finance) vs bank (river bank) Near Homophones maatraa vs maatra (hin) Word Boundary

आजायग (aajaayenge) (aa jaayenge (will come) or aaj aayenge(will come today)

I got [ua]plate His research is in human languages

Disfluency ah um ahem etc

(near homophone trouble) The king of Abu Dhabi expired and there was national mourning for 7 days Some children were playing in the evening when a person chided them Do not play it is mourning time The children said No it is evening time and we will play

21 July 2014Pushpak Bhattacharyya Intro

POS 16

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 17

Morphology Word formation rules from root words Nouns Plural (boy-boys) Gender marking (czar-czarina) Verbs Tense (stretch-stretched) Aspect (eg perfective sit-had

sat) Modality (eg request khaanaa khaaiie) First crucial first step in NLP Languages rich in morphology eg Dravidian Hungarian

Turkish Languages poor in morphology Chinese English Languages with rich morphology have the advantage of easier

processing at higher stages of processing A task of interest to computer science Finite State Machines for

Word Morphology

21 July 2014Pushpak Bhattacharyya Intro

POS 18

Lexical Analysis

Dictionary and word properties

dognoun (lexical property)take-rsquosrsquo-in-plural (morph property)animate (semantic property)4-legged (-do-)carnivore (-do)

21 July 2014Pushpak Bhattacharyya Intro

POS 19

Lexical Disambiguationpart of Speech Disambiguation

Dog as a noun (animal) Dog as a verb (to pursue)

Sense Disambiguation Dog (as animal) Dog (as a very detestable person) The chair emphasised the need for adult education

Very common in day to day communicationsSatellite Channel Ad Watch what you want when you

want (two senses of watch)Ground breaking ceremonyresearch(ToI 14114) India eradicates polio says WHO

21 July 2014Pushpak Bhattacharyya Intro

POS 20

Technological developments bring in new terms additional meaningsnuances for existing terms

Justify as in justify the right margin (word processing context)

Xeroxed a new verb Digital Trace a new expression Communifaking pretending to talk on

mobile when you are actually not Discomgooglation anxietydiscomfort at

not being able to access internet Helicopter Parenting over parenting Obamagain Obama care modinomics

21 July 2014Pushpak Bhattacharyya Intro

POS 21

Ambiguity of Multiwords The grandfather kicked the bucket after suffering from cancer This job is a piece of cake Put the sweater on He is the dark horse of the match

Google Translations of above sentencesदादा कसर स पी ड़त होन क बाद बा ट लात मार इस काम क कक का एक टकड़ा हवटर पर रखोवह मच क अधर घोड़ा ह

21 July 2014Pushpak Bhattacharyya Intro

POS 22

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Multilinguality Indian situation Major streams

Indo European Dravidian Sino Tibetan Austro-Asiatic

Some languages are ranked within 20 in the world in terms of the populations speaking them Hindi and Urdu 5th (~500

milion) Bangla 7th (~300 million) Marathi 14th (~70 million)

NLP architecture and stages of processing- ambiguity at every stage

Phonetics and phonology Morphology Lexical Analysis Syntactic Analysis Semantic Analysis Pragmatics Discourse

21 July 2014Pushpak Bhattacharyya Intro

POS 15

Phonetics processing of speech sound and associated challenges

Homophones bank (finance) vs bank (river bank) Near Homophones maatraa vs maatra (hin) Word Boundary

आजायग (aajaayenge) (aa jaayenge (will come) or aaj aayenge(will come today)

I got [ua]plate His research is in human languages

Disfluency ah um ahem etc

(near homophone trouble) The king of Abu Dhabi expired and there was national mourning for 7 days Some children were playing in the evening when a person chided them Do not play it is mourning time The children said No it is evening time and we will play

21 July 2014Pushpak Bhattacharyya Intro

POS 16

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 17

Morphology Word formation rules from root words Nouns Plural (boy-boys) Gender marking (czar-czarina) Verbs Tense (stretch-stretched) Aspect (eg perfective sit-had

sat) Modality (eg request khaanaa khaaiie) First crucial first step in NLP Languages rich in morphology eg Dravidian Hungarian

Turkish Languages poor in morphology Chinese English Languages with rich morphology have the advantage of easier

processing at higher stages of processing A task of interest to computer science Finite State Machines for

Word Morphology

21 July 2014Pushpak Bhattacharyya Intro

POS 18

Lexical Analysis

Dictionary and word properties

dognoun (lexical property)take-rsquosrsquo-in-plural (morph property)animate (semantic property)4-legged (-do-)carnivore (-do)

21 July 2014Pushpak Bhattacharyya Intro

POS 19

Lexical Disambiguationpart of Speech Disambiguation

Dog as a noun (animal) Dog as a verb (to pursue)

Sense Disambiguation Dog (as animal) Dog (as a very detestable person) The chair emphasised the need for adult education

Very common in day to day communicationsSatellite Channel Ad Watch what you want when you

want (two senses of watch)Ground breaking ceremonyresearch(ToI 14114) India eradicates polio says WHO

21 July 2014Pushpak Bhattacharyya Intro

POS 20

Technological developments bring in new terms additional meaningsnuances for existing terms

Justify as in justify the right margin (word processing context)

Xeroxed a new verb Digital Trace a new expression Communifaking pretending to talk on

mobile when you are actually not Discomgooglation anxietydiscomfort at

not being able to access internet Helicopter Parenting over parenting Obamagain Obama care modinomics

21 July 2014Pushpak Bhattacharyya Intro

POS 21

Ambiguity of Multiwords The grandfather kicked the bucket after suffering from cancer This job is a piece of cake Put the sweater on He is the dark horse of the match

Google Translations of above sentencesदादा कसर स पी ड़त होन क बाद बा ट लात मार इस काम क कक का एक टकड़ा हवटर पर रखोवह मच क अधर घोड़ा ह

21 July 2014Pushpak Bhattacharyya Intro

POS 22

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

NLP architecture and stages of processing- ambiguity at every stage

Phonetics and phonology Morphology Lexical Analysis Syntactic Analysis Semantic Analysis Pragmatics Discourse

21 July 2014Pushpak Bhattacharyya Intro

POS 15

Phonetics processing of speech sound and associated challenges

Homophones bank (finance) vs bank (river bank) Near Homophones maatraa vs maatra (hin) Word Boundary

आजायग (aajaayenge) (aa jaayenge (will come) or aaj aayenge(will come today)

I got [ua]plate His research is in human languages

Disfluency ah um ahem etc

(near homophone trouble) The king of Abu Dhabi expired and there was national mourning for 7 days Some children were playing in the evening when a person chided them Do not play it is mourning time The children said No it is evening time and we will play

21 July 2014Pushpak Bhattacharyya Intro

POS 16

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 17

Morphology Word formation rules from root words Nouns Plural (boy-boys) Gender marking (czar-czarina) Verbs Tense (stretch-stretched) Aspect (eg perfective sit-had

sat) Modality (eg request khaanaa khaaiie) First crucial first step in NLP Languages rich in morphology eg Dravidian Hungarian

Turkish Languages poor in morphology Chinese English Languages with rich morphology have the advantage of easier

processing at higher stages of processing A task of interest to computer science Finite State Machines for

Word Morphology

21 July 2014Pushpak Bhattacharyya Intro

POS 18

Lexical Analysis

Dictionary and word properties

dognoun (lexical property)take-rsquosrsquo-in-plural (morph property)animate (semantic property)4-legged (-do-)carnivore (-do)

21 July 2014Pushpak Bhattacharyya Intro

POS 19

Lexical Disambiguationpart of Speech Disambiguation

Dog as a noun (animal) Dog as a verb (to pursue)

Sense Disambiguation Dog (as animal) Dog (as a very detestable person) The chair emphasised the need for adult education

Very common in day to day communicationsSatellite Channel Ad Watch what you want when you

want (two senses of watch)Ground breaking ceremonyresearch(ToI 14114) India eradicates polio says WHO

21 July 2014Pushpak Bhattacharyya Intro

POS 20

Technological developments bring in new terms additional meaningsnuances for existing terms

Justify as in justify the right margin (word processing context)

Xeroxed a new verb Digital Trace a new expression Communifaking pretending to talk on

mobile when you are actually not Discomgooglation anxietydiscomfort at

not being able to access internet Helicopter Parenting over parenting Obamagain Obama care modinomics

21 July 2014Pushpak Bhattacharyya Intro

POS 21

Ambiguity of Multiwords The grandfather kicked the bucket after suffering from cancer This job is a piece of cake Put the sweater on He is the dark horse of the match

Google Translations of above sentencesदादा कसर स पी ड़त होन क बाद बा ट लात मार इस काम क कक का एक टकड़ा हवटर पर रखोवह मच क अधर घोड़ा ह

21 July 2014Pushpak Bhattacharyya Intro

POS 22

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Phonetics processing of speech sound and associated challenges

Homophones bank (finance) vs bank (river bank) Near Homophones maatraa vs maatra (hin) Word Boundary

आजायग (aajaayenge) (aa jaayenge (will come) or aaj aayenge(will come today)

I got [ua]plate His research is in human languages

Disfluency ah um ahem etc

(near homophone trouble) The king of Abu Dhabi expired and there was national mourning for 7 days Some children were playing in the evening when a person chided them Do not play it is mourning time The children said No it is evening time and we will play

21 July 2014Pushpak Bhattacharyya Intro

POS 16

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 17

Morphology Word formation rules from root words Nouns Plural (boy-boys) Gender marking (czar-czarina) Verbs Tense (stretch-stretched) Aspect (eg perfective sit-had

sat) Modality (eg request khaanaa khaaiie) First crucial first step in NLP Languages rich in morphology eg Dravidian Hungarian

Turkish Languages poor in morphology Chinese English Languages with rich morphology have the advantage of easier

processing at higher stages of processing A task of interest to computer science Finite State Machines for

Word Morphology

21 July 2014Pushpak Bhattacharyya Intro

POS 18

Lexical Analysis

Dictionary and word properties

dognoun (lexical property)take-rsquosrsquo-in-plural (morph property)animate (semantic property)4-legged (-do-)carnivore (-do)

21 July 2014Pushpak Bhattacharyya Intro

POS 19

Lexical Disambiguationpart of Speech Disambiguation

Dog as a noun (animal) Dog as a verb (to pursue)

Sense Disambiguation Dog (as animal) Dog (as a very detestable person) The chair emphasised the need for adult education

Very common in day to day communicationsSatellite Channel Ad Watch what you want when you

want (two senses of watch)Ground breaking ceremonyresearch(ToI 14114) India eradicates polio says WHO

21 July 2014Pushpak Bhattacharyya Intro

POS 20

Technological developments bring in new terms additional meaningsnuances for existing terms

Justify as in justify the right margin (word processing context)

Xeroxed a new verb Digital Trace a new expression Communifaking pretending to talk on

mobile when you are actually not Discomgooglation anxietydiscomfort at

not being able to access internet Helicopter Parenting over parenting Obamagain Obama care modinomics

21 July 2014Pushpak Bhattacharyya Intro

POS 21

Ambiguity of Multiwords The grandfather kicked the bucket after suffering from cancer This job is a piece of cake Put the sweater on He is the dark horse of the match

Google Translations of above sentencesदादा कसर स पी ड़त होन क बाद बा ट लात मार इस काम क कक का एक टकड़ा हवटर पर रखोवह मच क अधर घोड़ा ह

21 July 2014Pushpak Bhattacharyya Intro

POS 22

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 17

Morphology Word formation rules from root words Nouns Plural (boy-boys) Gender marking (czar-czarina) Verbs Tense (stretch-stretched) Aspect (eg perfective sit-had

sat) Modality (eg request khaanaa khaaiie) First crucial first step in NLP Languages rich in morphology eg Dravidian Hungarian

Turkish Languages poor in morphology Chinese English Languages with rich morphology have the advantage of easier

processing at higher stages of processing A task of interest to computer science Finite State Machines for

Word Morphology

21 July 2014Pushpak Bhattacharyya Intro

POS 18

Lexical Analysis

Dictionary and word properties

dognoun (lexical property)take-rsquosrsquo-in-plural (morph property)animate (semantic property)4-legged (-do-)carnivore (-do)

21 July 2014Pushpak Bhattacharyya Intro

POS 19

Lexical Disambiguationpart of Speech Disambiguation

Dog as a noun (animal) Dog as a verb (to pursue)

Sense Disambiguation Dog (as animal) Dog (as a very detestable person) The chair emphasised the need for adult education

Very common in day to day communicationsSatellite Channel Ad Watch what you want when you

want (two senses of watch)Ground breaking ceremonyresearch(ToI 14114) India eradicates polio says WHO

21 July 2014Pushpak Bhattacharyya Intro

POS 20

Technological developments bring in new terms additional meaningsnuances for existing terms

Justify as in justify the right margin (word processing context)

Xeroxed a new verb Digital Trace a new expression Communifaking pretending to talk on

mobile when you are actually not Discomgooglation anxietydiscomfort at

not being able to access internet Helicopter Parenting over parenting Obamagain Obama care modinomics

21 July 2014Pushpak Bhattacharyya Intro

POS 21

Ambiguity of Multiwords The grandfather kicked the bucket after suffering from cancer This job is a piece of cake Put the sweater on He is the dark horse of the match

Google Translations of above sentencesदादा कसर स पी ड़त होन क बाद बा ट लात मार इस काम क कक का एक टकड़ा हवटर पर रखोवह मच क अधर घोड़ा ह

21 July 2014Pushpak Bhattacharyya Intro

POS 22

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Morphology Word formation rules from root words Nouns Plural (boy-boys) Gender marking (czar-czarina) Verbs Tense (stretch-stretched) Aspect (eg perfective sit-had

sat) Modality (eg request khaanaa khaaiie) First crucial first step in NLP Languages rich in morphology eg Dravidian Hungarian

Turkish Languages poor in morphology Chinese English Languages with rich morphology have the advantage of easier

processing at higher stages of processing A task of interest to computer science Finite State Machines for

Word Morphology

21 July 2014Pushpak Bhattacharyya Intro

POS 18

Lexical Analysis

Dictionary and word properties

dognoun (lexical property)take-rsquosrsquo-in-plural (morph property)animate (semantic property)4-legged (-do-)carnivore (-do)

21 July 2014Pushpak Bhattacharyya Intro

POS 19

Lexical Disambiguationpart of Speech Disambiguation

Dog as a noun (animal) Dog as a verb (to pursue)

Sense Disambiguation Dog (as animal) Dog (as a very detestable person) The chair emphasised the need for adult education

Very common in day to day communicationsSatellite Channel Ad Watch what you want when you

want (two senses of watch)Ground breaking ceremonyresearch(ToI 14114) India eradicates polio says WHO

21 July 2014Pushpak Bhattacharyya Intro

POS 20

Technological developments bring in new terms additional meaningsnuances for existing terms

Justify as in justify the right margin (word processing context)

Xeroxed a new verb Digital Trace a new expression Communifaking pretending to talk on

mobile when you are actually not Discomgooglation anxietydiscomfort at

not being able to access internet Helicopter Parenting over parenting Obamagain Obama care modinomics

21 July 2014Pushpak Bhattacharyya Intro

POS 21

Ambiguity of Multiwords The grandfather kicked the bucket after suffering from cancer This job is a piece of cake Put the sweater on He is the dark horse of the match

Google Translations of above sentencesदादा कसर स पी ड़त होन क बाद बा ट लात मार इस काम क कक का एक टकड़ा हवटर पर रखोवह मच क अधर घोड़ा ह

21 July 2014Pushpak Bhattacharyya Intro

POS 22

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Lexical Analysis

Dictionary and word properties

dognoun (lexical property)take-rsquosrsquo-in-plural (morph property)animate (semantic property)4-legged (-do-)carnivore (-do)

21 July 2014Pushpak Bhattacharyya Intro

POS 19

Lexical Disambiguationpart of Speech Disambiguation

Dog as a noun (animal) Dog as a verb (to pursue)

Sense Disambiguation Dog (as animal) Dog (as a very detestable person) The chair emphasised the need for adult education

Very common in day to day communicationsSatellite Channel Ad Watch what you want when you

want (two senses of watch)Ground breaking ceremonyresearch(ToI 14114) India eradicates polio says WHO

21 July 2014Pushpak Bhattacharyya Intro

POS 20

Technological developments bring in new terms additional meaningsnuances for existing terms

Justify as in justify the right margin (word processing context)

Xeroxed a new verb Digital Trace a new expression Communifaking pretending to talk on

mobile when you are actually not Discomgooglation anxietydiscomfort at

not being able to access internet Helicopter Parenting over parenting Obamagain Obama care modinomics

21 July 2014Pushpak Bhattacharyya Intro

POS 21

Ambiguity of Multiwords The grandfather kicked the bucket after suffering from cancer This job is a piece of cake Put the sweater on He is the dark horse of the match

Google Translations of above sentencesदादा कसर स पी ड़त होन क बाद बा ट लात मार इस काम क कक का एक टकड़ा हवटर पर रखोवह मच क अधर घोड़ा ह

21 July 2014Pushpak Bhattacharyya Intro

POS 22

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Lexical Disambiguationpart of Speech Disambiguation

Dog as a noun (animal) Dog as a verb (to pursue)

Sense Disambiguation Dog (as animal) Dog (as a very detestable person) The chair emphasised the need for adult education

Very common in day to day communicationsSatellite Channel Ad Watch what you want when you

want (two senses of watch)Ground breaking ceremonyresearch(ToI 14114) India eradicates polio says WHO

21 July 2014Pushpak Bhattacharyya Intro

POS 20

Technological developments bring in new terms additional meaningsnuances for existing terms

Justify as in justify the right margin (word processing context)

Xeroxed a new verb Digital Trace a new expression Communifaking pretending to talk on

mobile when you are actually not Discomgooglation anxietydiscomfort at

not being able to access internet Helicopter Parenting over parenting Obamagain Obama care modinomics

21 July 2014Pushpak Bhattacharyya Intro

POS 21

Ambiguity of Multiwords The grandfather kicked the bucket after suffering from cancer This job is a piece of cake Put the sweater on He is the dark horse of the match

Google Translations of above sentencesदादा कसर स पी ड़त होन क बाद बा ट लात मार इस काम क कक का एक टकड़ा हवटर पर रखोवह मच क अधर घोड़ा ह

21 July 2014Pushpak Bhattacharyya Intro

POS 22

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Technological developments bring in new terms additional meaningsnuances for existing terms

Justify as in justify the right margin (word processing context)

Xeroxed a new verb Digital Trace a new expression Communifaking pretending to talk on

mobile when you are actually not Discomgooglation anxietydiscomfort at

not being able to access internet Helicopter Parenting over parenting Obamagain Obama care modinomics

21 July 2014Pushpak Bhattacharyya Intro

POS 21

Ambiguity of Multiwords The grandfather kicked the bucket after suffering from cancer This job is a piece of cake Put the sweater on He is the dark horse of the match

Google Translations of above sentencesदादा कसर स पी ड़त होन क बाद बा ट लात मार इस काम क कक का एक टकड़ा हवटर पर रखोवह मच क अधर घोड़ा ह

21 July 2014Pushpak Bhattacharyya Intro

POS 22

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Ambiguity of Multiwords The grandfather kicked the bucket after suffering from cancer This job is a piece of cake Put the sweater on He is the dark horse of the match

Google Translations of above sentencesदादा कसर स पी ड़त होन क बाद बा ट लात मार इस काम क कक का एक टकड़ा हवटर पर रखोवह मच क अधर घोड़ा ह

21 July 2014Pushpak Bhattacharyya Intro

POS 22

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Ambiguity of Named Entities

Bengali চ ল সরকার বািড়েত আেছEnglish Government is restless at home ()Chanchal Sarkar is at home

Amsterdam airport ldquoBaby Changing Roomrdquo

Hindi द नक दबग द नयाEnglish everyday bold worldActually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July 2014Pushpak Bhattacharyya Intro

POS 23

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 24

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

StructureS

NPVP

V NP

Ilike mangoes

21 July 2014Pushpak Bhattacharyya Intro

POS25

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Structural Ambiguity Scope

1The old men and women were taken to safe locations(old men and women) vs ((old men) and women)2 No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope) I saw the mountain with a telescope

(world knowledge mountain cannot be an instrument of seeing)

Very ubiquitous newspaper headline ldquo20 years later BMC pays father 20 lakhs for causing sonrsquos deathrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 26

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Garden pathing

The only minus possibly was the need to face the audience more and more insightful question answer

The old man the boat

The horse raced past the garden fell

21 July 2014Pushpak Bhattacharyya Intro

POS 27

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 28

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Semantic Analysis Representation in terms of

Predicate calculusSemantic NetsFramesConceptual Dependencies and Scripts

John gave a book to Mary Give action Agent John Object Book Recipient

Mary Challenge ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance (Hin) aapko mujhe mithaai khilaanii padegii

(ambiguous in Marathi and Bengali too not in Dravidian languages)

21 July 2014Pushpak Bhattacharyya Intro

POS 29

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Coreference challenge

Binding of co-referring nouns and pronouns The monkey ate the banana because it

was hungry The monkey ate the banana because it

was ripe and sweet The monkey ate the banana because it

was lunch time

21 July 2014Pushpak Bhattacharyya Intro

POS 30

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 31

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Pragmatics Very hard problem Model user intention

Tourist (in a hurry checking out of the hotel motioning to the service boy) Boy go upstairs and see if my sandals are under the divan Do not be late I just have 15 minutes to catch the train

Boy (running upstairs and coming back panting) yes sir they are there

World knowledge WHY INDIA NEEDS A SECOND OCTOBER (ToI

21007)

21 July 2014Pushpak Bhattacharyya Intro

POS 32

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

NLP Architecture

21 July 2014Pushpak Bhattacharyya Intro

POS 33

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

DiscourseProcessing of sequence of sentences Mother to John

John go to school It is open today Should you bunk Father will be very angry

Ambiguity of openbunk whatWhy will the father be angry

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

21 July 2014Pushpak Bhattacharyya Intro

POS 34

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Complexity of Connected Text

John was returning from school dejected ndash today was the math test

He couldnrsquot control the class

Teacher shouldnrsquot have made him responsible

After all he is just a janitor

21 July 2014Pushpak Bhattacharyya Intro

POS 35

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Textual Humour (12)1 Teacher (angrily) did you miss the class yesterday

Student not much

2 A man coming back to his parked car sees the sticker Parking fine He goes and thanks the policeman for appreciating his parking skill

3 John I got a Jaguar car for my unemployed youngest sonJack Thats a great exchange

21 July 2014Pushpak Bhattacharyya Intro

POS 36

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Textual Humour (22)A teacher-student exchangeTeacher What do you think is the capital of Ethiopia

Student What do you think

Teacher (angrily) I do not think ltpausegt I know

Student I do not think I know ltno pausegt21 July 2014

Pushpak Bhattacharyya Intro POS 37

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Example of Application of Noisy Channel Model Probabilistic Speech Recognition (Isolated Word)[8]

Problem Definition Given a sequence of speech signals identify the words

2 steps Segmentation (Word Boundary Detection) Identify the word

Isolated Word Recognition Identify W given SS (speech signal)

^arg max ( | )

WW P W SS

21 July 2014Pushpak Bhattacharyya Intro

POS 38

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Identifying the word

P(SS|W) = likelihood called ldquophonological model ldquo intuitively more tractable

P(W) = prior probability called ldquolanguage modelrdquo

^arg m ax ( | )

arg m ax ( ) ( | )W

W

W P W SS

P W P SS W

W appears in the corpus( ) words in the corpus

P W

21 July 2014Pushpak Bhattacharyya Intro

POS 39

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Ambiguities in the context of P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity whether vs weather right vs write bought vs bot

Text Sound ambiguity read (present tense) vs read (past tense) lead (verb) vs lead (noun)

21 July 2014Pushpak Bhattacharyya Intro

POS 40

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Primitives

Phonemes (sound) Syllables ASCII bytes (machine representation)

21 July 2014Pushpak Bhattacharyya Intro

POS 41

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Phonemes

Standardized by the IPA (International Phonetic Alphabet) convention

t sound of t in tag d sound of d in dog D sound of the

21 July 2014Pushpak Bhattacharyya Intro

POS 42

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Syllables

Advise (verb) Advice (noun)

ad viceadvise

bull Consists of1 Nucleus2 Onset3 Coda

21 July 2014Pushpak Bhattacharyya Intro

POS 43

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is ldquotomatordquo) = Product of arc probabilities

t

s4

o m o

ae

t

aa

end

s1 s2 s3s5

s6 s7

10 10 10 1010

10

073

027

Word

Pronunciation Automaton

Tomato

21 July 2014Pushpak Bhattacharyya Intro

POS 44

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Foundational question

Generative vs Discrimnative

21 July 2014Pushpak Bhattacharyya Intro

POS 45

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

How are two entities matched

bull Entity A and Entity BndashMatch(AB)

ndashTwo entities match iff their parts matchbull Match(Parts(A) Parts(B))

ndashTwo entities match iff their properties matchbull Match(Properties(A) Properties(B))

bull Heart of discriminative vs generative scoring

21 July 2014Pushpak Bhattacharyya Intro

POS 46

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Books Journals Proceedings Main Text(s)

Natural Language Understanding James Allan Speech and NLP Jurafsky and Martin Foundations of Statistical NLP Manning and Schutze

Other References Statistical NLP Charniak

Journals Computational Linguistics Natural Language Engineering AI AI

Magazine IEEE SMC Conferences

ACL EACL COLING MT Summit EMNLP IJCNLP HLT ICON SIGIR WWW ICML ECML

21 July 2014Pushpak Bhattacharyya Intro

POS 47

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Allied DisciplinesPhilosophy Semantics Meaning of ldquomeaningrdquo Logic

(syllogism)Linguistics Study of Syntax Lexicon Lexical Semantics etc

Probability and Statistics Corpus Linguistics Testing of Hypotheses System Evaluation

Cognitive Science Computational Models of Language Processing Language Acquisition

Psychology Behavioristic insights into Language Processing Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory Entropy Random Fields

Computer Sc amp Engg Systems for NLP

21 July 2014Pushpak Bhattacharyya Intro

POS 48

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Day wise schedule (14)

Day-1 Introduction NLP as playground for rule based and statistical techniques Before break Complete NLP architecture Ambiguity start of

POS tagging After Break NLTK (open source python based framework of

comprehensive NLP tools) POS tagging assignment

Day-2 Shallow parsing Before break Morph analysis and synthesis (segmentation

infection declension derivation etc ) Rule based Vs Statistical NLU comparison with POS tagging as case study Hidden Markov Model and Viterbi algorithm

After break POS tagging assignment continued

21 July 2014Pushpak Bhattacharyya Intro

POS 49

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Day wise schedule (24)

Day-3 Syntactic Parsing Before break Parsing- classical and statistical theory and

techniques After break Hands on with probabilistic parser

Day-4 Semantics Before break Rule based NLU case study of semantic graph

generation through Universal Networking Language (UNL) After break continue POS tagging and Parsing assignments

21 July 2014Pushpak Bhattacharyya Intro

POS 50

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Day wise schedule (34)

Day-5 Lexical resources Before break Wordnet ConceptNet FrameNet VerbNet etc After break Hands-on on Lexical Resources NELL NEIL

Day-6 Information Extraction Text classification and basic search Before break Named Entity Recognition Text Entailment

Lucene Nutch etc After break NER Hands-on basic search Open IE system

21 July 2014Pushpak Bhattacharyya Intro

POS 51

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Day wise schedule (44)

Day-7 Affective NLP (cognitive and culture specific NLP) Before break Sentiment Analysis Pragmatics Intent

recognition (Sarcasm Thwarting) Eye-Tracking After break Machine learning techniques with sentiment

analysis as target

Day-8 Deep Learning Before break Word Vectors and embedding Neural Nets

Neural language models After break Discussion on deep learning tool

Day-9 and 10 Projects and quiz

21 July 2014Pushpak Bhattacharyya Intro

POS 52

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Summarybull Both Linguistics and Computation needed

bull Linguistics is the eye Computation the body

bull PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

bull has accorded to NLP the prestige it commands today

bull Natural Science like approach

bull Neither Theory Building nor Data Driven Pattern finding can be ignored21 July 2014

Pushpak Bhattacharyya Intro POS 53

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Part of Speech Tagging

With Hidden Markov Model

21 July 2014Pushpak Bhattacharyya Intro

POS 54

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 55

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Part of Speech Tagging

POS Tagging attaches to each word in a sentence a part of speech tag from a given set of tags called the Tag-Set

Standard Tag-set Penn Treebank (for English)

21 July 2014Pushpak Bhattacharyya Intro

POS 56

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Example

ldquo_ldquo The_DT mechanisms_NNS that_WDTmake_VBP traditional_JJ hardware_NNare_VBP really_RB being_VBGobsoleted_VBN by_IN microprocessor-based_JJ machines_NNS _ rdquo_rdquo said_VBDMr_NNP Benton_NNP _

21 July 2014Pushpak Bhattacharyya Intro

POS 57

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity OfProcessing

21 July 2014Pushpak Bhattacharyya Intro

POS58

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Example to illustrate complexity of POS taggng

21 July 2014Pushpak Bhattacharyya Intro

POS 59

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N _F who_F has_V always_R let_V his_N bat_N talk_V _F struggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N _F

N (noun) V (verb) J (adjective) R (adverb) and F (other ie function words)

21 July 2014Pushpak Bhattacharyya Intro

POS 60

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

POS disambiguation That_FNJ (lsquothatrsquo can be complementizer (can be put under lsquoFrsquo)

demonstrative (can be put under lsquoJrsquo) or pronoun (can be put under lsquoNrsquo)) former_J Sri_NJ Lanka_NJ (Sri Lanka together qualify the skipper) skipper_NV (lsquoskipperrsquo can be a verb too) and_F ace_JN (lsquoacersquo can be both J and N ldquoNadal served an acerdquo) batsman_NJ (lsquobatsmanrsquo can be J as it qualifies Aravinda De Silva) Aravinda_N De_N Silva_N is_F a_F man_NV (lsquomanrsquo can verb too as inrsquoman the boatrsquo) of_F few_J words_NV (lsquowordsrsquo can be verb too as in lsquohe words is speeches

beautifullyrsquo)

21 July 2014Pushpak Bhattacharyya Intro

POS 61

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Behaviour of ldquoThatrdquo That

That man is known by the company he keeps (Demonstrative)

Man that is known by the company he keeps gets a good job (Pronoun)

That man is known by the company he keeps is a proverb (Complementation)

Chaotic systems Systems where a small perturbation in input causes a large change in output

21 July 2014Pushpak Bhattacharyya Intro

POS 62

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

POS disambiguation was_F very_R much_R evident_J on_F Wednesday_N when_FN (lsquowhenrsquo can be a relative pronoun (put under lsquoN) as in lsquoI

know the time when he comesrsquo) the_F legendary_J batsman_N who_FN has_V always_R let_V his_N bat_NV talk_VN struggle_V N answer_VN barrage_NV question_NV function_NV promote_V cricket_N league_N city_N

21 July 2014Pushpak Bhattacharyya Intro

POS 63

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Mathematics of POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 64

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Argmax computation (12)Best tag sequence= T= argmax P(T|W)= argmax P(T)P(W|T) (by Bayersquos Theorem)

P(T) = P(t0=^ t1t2 hellip tn+1=)= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) hellip

P(tn|tn-1tn-2hellipt0)P(tn+1|tntn-1hellipt0)= P(t0)P(t1|t0)P(t2|t1) hellip P(tn|tn-1)P(tn+1|tn)

= P(ti|ti-1) Bigram AssumptionprodN+1

i = 0

21 July 2014Pushpak Bhattacharyya Intro

POS 65

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Argmax computation (22)

P(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) hellipP(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption A word is determined completely by its tag This is inspired by speech recognition

= P(wo|to)P(w1|t1) hellip P(wn+1|tn+1)

= P(wi|ti)

= P(wi|ti) (Lexical Probability Assumption)

prodn+1

i = 0

prodn+1

i = 1

21 July 2014Pushpak Bhattacharyya Intro

POS 66

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Generative Model

^_^ People_N Jump_V High_R _

^ N

V

V

N

A

N

Lexical Probabilities

BigramProbabilities

This model is called Generative model Here words are observed from tags as statesThis is similar to HMM

21 July 2014Pushpak Bhattacharyya Intro

POS 67

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Typical POS tag steps

Implementation of Viterbi ndash Unigram

Bigram

Five Fold Evaluation

Per POS Accuracy

Confusion Matrix

21 July 2014Pushpak Bhattacharyya Intro

POS 68

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

0

02

04

06

08

1

12

AJ0

AJ0-

NN

1

AJ0-

VVG

AJC

AT0

AV0-

AJ0

AVP-

PRP

AVQ

-CJS CJS

CJS-

PRP

CJT-

DT0

CRD

-PN

I

DT0

DTQ IT

J

NN

1

NN

1-N

P0

NN

1-VV

G

NN

2-VV

Z

NP0

-NN

1

PNI

PNP

PNX

PRP

PRP-

CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-

NN

1

VVD

-AJ0

VVG

VVG

-NN

1

VVN

VVN

-VVD

VVZ-

NN

2

Series1

Per POS Accuracy for Bigram Assumption

21 July 2014Pushpak Bhattacharyya Intro

POS 69

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Screen shot of typical Confusion Matrix

AJ0 AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0AJC 2 0 0 0 0 0 69 0 0 11 0 0AJS 6 0 0 0 0 0 0 38 0 2 0 0AT0 192 0 0 0 0 0 0 0 7000 13 0 0AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0AVP 24 0 0 0 0 0 0 0 1 11 0 737

21 July 2014Pushpak Bhattacharyya Intro

POS 70

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorphAnalysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July 2014Pushpak Bhattacharyya Intro

POS 71

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

A Motivating Example

Urn 1 of Red = 30

of Green = 50 of Blue = 20

Urn 3 of Red =60

of Green =10 of Blue = 30

Urn 2 of Red = 10

of Green = 40 of Blue = 50

Colored Ball choosing

21 July 2014Pushpak Bhattacharyya Intro

POS 72

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Example (contd)

U1 U2 U3

U1 01 04 05U2 06 02 02U3 03 04 03

Given

Observation RRGGBRGR

State Sequence

Not so Easily Computable

and

R G BU1 03 05 02U2 01 04 05U3 06 01 03

21 July 2014Pushpak Bhattacharyya Intro

POS 73

Emission probability tableTransition probability table

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Diagrammatic representation (12)

U1

U2

U3

01

02

04

06

04

05

03

02

03

R 06

G 01

B 03

R 01

B 05

G 04

B 02

R 03 G 05

21 July 2014Pushpak Bhattacharyya Intro

POS 74

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Diagrammatic representation (22)

U1

U2

U3

R002G008B010

R024G004B012

R006G024B030R 008

G 020B 012

R015G025B010

R018G003B009

R018G003B009

R002G008B010

R003G005B002

21 July 2014Pushpak Bhattacharyya Intro

POS 75

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Classic problems with respect to HMM1Given the observation sequence find the

possible state sequences- Viterbi2Given the observation sequence find its

probability- forwardbackward algorithm3Given the observation sequence find the

HMM prameters- Baum-Welch algorithm

21 July 2014Pushpak Bhattacharyya Intro

POS 76

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Illustration of Viterbi The ldquostartrdquo and ldquoendrdquo are important in a

sequence Subtrees get eliminated due to the Markov

Assumption

POS Tagset N(noun) V(verb) O(other) [simplified] ^ (start) (end) [start amp end states]

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Illustration of ViterbiLexicon

people N Vlaugh N V

Corpora for Training^ w11_t11 w12_t12 w13_t13 helliphelliphelliphelliphelliphellipw1k_1_t1k_1 ^ w21_t21 w22_t22 w23_t23 helliphelliphelliphelliphelliphellipw2k_2_t2k_2 ^ wn1_tn1 wn2_tn2 wn3_tn3 helliphelliphelliphelliphelliphellipwnk_n_tnk_n

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Inference

^

NN

NV

^ N V O

^ 0 06 02 02 0

N 0 01 04 03 02

V 0 03 01 03 03

O 0 03 02 03 02

1 0 0 0 0

This transition table will change from language to language due to language divergences

Partial sequence graph

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Lexical Probability Table

Size of this table = pos tags in tagset X vocabulary size

vocabulary size = unique words in corpus

Є people laugh hellip

^ 1 0 0 0

N 0 1x10-3 1x10-5

V 0 1x10-6 1x10-3

O 0 0 1x10-9

1 0 0 0 0

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

InferenceNew Sentence

^ people laugh

p( ^ N N | ^ people laugh )= (06 x 01) x (01 x 1 x 10-3) x (02 x 1 x 10-5)

^

NN

NV

Є

Є

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Computational Complexity

If we have to get the probability of each sequence and then find maximum among them we would run into exponential number of computations

If |s| = states (tags + ^ + )and |o| = length of sentence ( words + ^ + )Then sequences = s|o|-2

But a large number of partial computations can be reused using Dynamic Programming

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Dynamic Programming^

N V O

3O2V1N OVN5OVN4

OVN OVN

Є

people

laugh

06 x 10 = 06 0

202

06 x 01 x 10-3 = 6 x 10-5

1 06 x 04 x 10-3 = 24 x 10-4

2 06 x 03 x 10-3 = 18 x 10-4

3 06 x 02 x 10-3 = 12 x 10-4

No need to expand N4and N5 because they will never be a part of the winning sequence

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Computational Complexity

Retain only those N V O nodes which ends in the highest sequence probability

Now complexity reduces from |s||o| to |s||o|

Here we followed the Markov assumption of order 1

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Points to ponder wrt HMM and Viterbi

21 July 2014Pushpak Bhattacharyya Intro

POS 85

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Viterbi Algorithm

Start with the start state Keep advancing sequences that are

ldquomaximumrdquo amongst all those ending in the same state

21 July 2014Pushpak Bhattacharyya Intro

POS 86

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Viterbi Algorithm^

N V O

N V O N V O N V O

(06) (02) (02)

(00610^-3)(02410^-3)

(01810^-3)

(00610^-6)

(00210^-6)

(00610^-6)

(0) (0) (0)

Claim We do not need to draw all the subtrees in the algorithm

Tree for the sentence ldquo^ People laugh rdquo

Ԑ

People

21 July 2014Pushpak Bhattacharyya Intro

POS 87

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Effect of shifting probability mass Will a word be always given the same tag No Consider the example

^ people the city with soldiers (ie lsquopopulatersquo)

^ quickly people the city In the first sentence ldquopeoplerdquo is most likely

to be tagged as noun whereas in the second probability mass will shift and ldquopeoplerdquo will be tagged as verb since it occurs after an adverb

21 July 2014Pushpak Bhattacharyya Intro

POS 88

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Tail phenomenon and Language phenomenon Long tail Phenomenon Probability is very low but not zero

over a large observed sequence

Language Phenomenon ldquopeoplerdquo which is predominantly tagged as ldquoNounrdquo displays

a long tail phenomenon ldquolaughrdquo is predominantly tagged as ldquoVerbrdquo

21 July 2014Pushpak Bhattacharyya Intro

POS 89

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Viterbi phenomenon (Markov process)

N1 N2

N V O N V O

(610^-5) (610^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability (lexical and transition) So children of N2 will have probability less than the children of N1

21 July 2014Pushpak Bhattacharyya Intro

POS 90

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

What does P(A|B) mean

P(A|B)= P(B|A)If P(A)=P(B)

P(A|B) means Causality B causes A Sequentiality A follows B

21 July 2014Pushpak Bhattacharyya Intro

POS 91

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Back to the Urn Example

Here S = U1 U2 U3 V = RGB

For observation O =o1hellip on

And State sequence Q =q1hellip qn

π is

U1 U2 U3

U1 01 04 05

U2 06 02 02

U3 03 04 03

R G B

U1 03 05 02

U2 01 04 05

U3 06 01 03

A =

B=

)( 1 ii UqP

92

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Observations and statesO1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

Si = U1U2U3 A particular stateS State sequenceO Observation sequenceS = ldquobestrdquo possible state (urn) sequenceGoal Maximize P(S|O) by choosing ldquobestrdquo S

21 July 2014Pushpak Bhattacharyya Intro

POS 93

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Goal

Maximize P(S|O) where S is the State Sequence and O is the Observation Sequence

))|((maxarg OSPS S

21 July 2014Pushpak Bhattacharyya Intro

POS 94

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

False Start

)|()|()|()|()|()|()|(

718213121

8181

OSSPOSSPOSSPOSPOSPOSPOSP

By Markov Assumption (a state depends only on the previous state)

)|()|()|()|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS R R G G B R G RState S1 S2 S3 S4 S5 S6 S7 S8

21 July 2014Pushpak Bhattacharyya Intro

POS 95

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Bayersquos Theorem

)()|()()|( BPABPAPBAP

P(A) - PriorP(B|A) - Likelihood

)|()(maxarg)|(maxarg SOPSPOSP SS

21 July 2014Pushpak Bhattacharyya Intro

POS 96

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

State Transitions Probability

)|()|()|()|()()()()(

718314213121

81

SSPSSPSSPSSPSPSPSPSP

By Markov Assumption (k=1)

)|()|()|()|()()( 783423121 SSPSSPSSPSSPSPSP

21 July 2014Pushpak Bhattacharyya Intro

POS 97

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Observation Sequence probability

)|()|()|()|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends only on the Urn chosen

)|()|()|()|()|( 88332211 SOPSOPSOPSOPSOP

)|()|()|()|()|()|()|()|()()|(

)|()()|(

88332211

783423121

SOPSOPSOPSOPSSPSSPSSPSSPSPOSP

SOPSPOSP

21 July 2014Pushpak Bhattacharyya Intro

POS 98

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Grouping terms

P(S)P(O|S)= [P(O0|S0)P(S1|S0)]

[P(O1|S1) P(S2|S1)][P(O2|S2) P(S3|S2)] [P(O3|S3)P(S4|S3)] [P(O4|S4)P(S5|S4)] [P(O5|S5)P(S6|S5)] [P(O6|S6)P(S7|S6)] [P(O7|S7)P(S8|S7)][P(O8|S8)P(S9|S8)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014Pushpak Bhattacharyya Intro

POS 99

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Introducing useful notation

S0 S1

S8

S7

S9

S2S3

S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

ε RRG G B R

G

R

P(Ok|Sk)P(Sk+1|Sk)=P(SkSk+1)Ok

21 July 2014Pushpak Bhattacharyya Intro

POS 100

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Probabilistic FSM

(a103)

(a204)

(a102)

(a203)

(a101)

(a202)

(a103)

(a202)

The question here isldquowhat is the most likely state sequence given the output sequenceseenrdquo

S1 S2

21 July 2014Pushpak Bhattacharyya Intro

POS 101

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Developing the treeStart

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

10 00

01 03 02 03

101=01 03 00 00

0102=002 0104=004 0303=009 0302=006

euro

a1

a2

Choose the winning sequence per stateper iteration

02 04 03 02

21 July 2014Pushpak Bhattacharyya Intro

POS 102

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Tree structure contdhellip

S1 S2

S1 S2 S1 S2

01 03 02 03

0027 0012

009 006

00901=0009 0018

S1

03

00081

S2

02

00054

S2

04

00048

S1

02

00024

a1

a2

The problem being addressed by this tree is )|(maxarg 2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and micro the model or the machine

21 July 2014Pushpak Bhattacharyya Intro

POS 103

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Path found (working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement Find the best possible sequence )|(maxarg OSPS

s

Machineor Model SeqOutput Seq State OSwhere

Machineor Model 0 TASS

Start symbol State collection Alphabet set

Transitions

T is defined as kjijk

i SaSP )(

21 July 2014Pushpak Bhattacharyya Intro

POS 104

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Tabular representation of the tree

euro a1 a2 a1 a2

S110 (10010002

)=(0100)(002 009)

(0009 0012) (00024 00081)

S200 (10030003

)=(0300)(00400

6)(00270018) (000480005

4)

Ending state

Latest symbol observed

Note Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in that state Going backward from final winner sequence which ends in state S2 (indicated By the 2nd tuple) we recover the sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 105

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Algorithm(following James Alan Natural Language Understanding (2nd edition) Benjamin Cummins (pub) 1995

Given 1 The HMM which means

a Start State S1

b Alphabet A = a1 a2 hellip apc Set of States S = S1 S2 hellip Snd Transition probability

which is equal to 2 The output string a1a2hellipaT

To find The most likely sequence of states C1C2hellipCT which produces the given output sequence ie C1C2hellipCT =

kjijk

i SaSP )(

)|( ikj SaSP

]|([maxarg 21 TC

aaaCP

21 July 2014Pushpak Bhattacharyya Intro

POS 106

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Algorithm contdhellip

Data Structure1 A NT array called SEQSCORE to maintain the

winner sequence always (N=states T=length of op sequence)

2 Another NT array called BACKPTR to recover the path

Three distinct steps in the Viterbi implementation1 Initialization2 Iteration3 Sequence Identification

21 July 2014Pushpak Bhattacharyya Intro

POS 107

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

1 Initialization

SEQSCORE(11)=10BACKPTR(11)=0For(i=2 to N) do

SEQSCORE(i1)=00[expressing the fact that first state is S1]

2 Iteration

For(t=2 to T) doFor(i=1 to N) do

SEQSCORE(it) = Max(j=1N)

BACKPTR(It) = index j that gives the MAX above

)]())1(([ SiaSjPtjSEQSCORE k

21 July 2014Pushpak Bhattacharyya Intro

POS 108

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

3 Seq Identification

C(T) = i that maximizes SEQSCORE(iT)For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1)(i+1)]

Optimizations possible1 BACKPTR can be 1T2 SEQSCORE can be T2

Homework- Compare this with A Beam Search [Homework]Reason for this comparison Both of them work for finding and recovering sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 109

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Viterbi Algorithm for the Urn problem (first two symbols)

S0

U1 U2 U3

0503

02

U1 U2 U3

003

008

015

U1 U2 U3 U1 U2 U3

006

002

002018

024

018

0015 004 0075 0018 0006 0006 0048 0036

ε

R

21 July 2014 Pushpak Bhattacharyya Intro POS

110

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Markov process of ordergt1 (say 2)

Same theory worksP(S)P(O|S)= P(O0|S0)P(S1|S0)

[P(O1|S1) P(S2|S1S0)][P(O2|S2) P(S3|S2S1)] [P(O3|S3)P(S4|S3S2)] [P(O4|S4)P(S5|S4S3)] [P(O5|S5)P(S6|S5S4)] [P(O6|S6)P(S7|S6S5)] [P(O7|S7)P(S8|S7S6)][P(O8|S8)P(S9|S8S7)]

We introduce the statesS0 and S9 as initial and final states respectively

After S8 the next state is S9 with probability 1 ie P(S9|S8S7)=1

O0 is ε-transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs ε R R G G B R G RState S0 S1 S2 S3 S4 S5 S6 S7 S8 S9

21 July 2014

Pushpak Bhattacharyya Intro POS 111

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Probability of observation sequence

21 July 2014Pushpak Bhattacharyya Intro

POS 112

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Why probability of observation sequence Language modeling problem

1P(ldquoThe sun rises in the eastrdquo)2P(ldquoThe sun rise in the eastrdquo)

bull Less probable because of grammatical mistake

3P(The svn rises in the east)bull Less probable because of lexical mistake

4P(The sun rises in the west)bull Less probable because of semantic mistake

Probabilities computed in the context of corpora

21 July 2014Pushpak Bhattacharyya Intro

POS 113

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Uses of language model1Detect well-formedness

bull Lexical syntactic semantic pragmatic discourse

2Language identificationbull Given a piece of text what language does it

belong toGood morning - EnglishGuten morgen - GermanBon jour - French

3Automatic speech recognition4Machine translation

21 July 2014Pushpak Bhattacharyya Intro

POS 114

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

How to compute P(o0o1o2o3hellipom)

ationMarginaliz )()( S

SOPOP

Consider the observation sequence

13210

210

mm SSSSSSOmOOO

Where Si s represent the state sequences

21 July 2014Pushpak Bhattacharyya Intro

POS 115

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Computing P(o0o1o2o3hellipom)

)]|()|()][|()|()[()|()|()|(

)|()|()|()()|()(

)|()()(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSPSOOOOPSSSSP

SOPSPSOP

21 July 2014Pushpak Bhattacharyya Intro

POS 116

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Forward and Backward Probability Calculation

21 July 2014Pushpak Bhattacharyya Intro

POS 117

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Forward probability F(ki)

Define F(ki)= Probability of being in state Si having seen o0o1o2hellipok

F(ki)=P(o0o1o2hellipok Si ) With m as the length of the observed

sequence There are N states P(observed sequence)=P(o0o1o2om)

=Σp=0N P(o0o1o2om Sp)=Σp=0N F(m p)

21 July 2014Pushpak Bhattacharyya Intro

POS 118

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Forward probability (contd)F(k q)= P(o0o1o2ok Sq)= P(o0o1o2ok Sq)= P(o0o1o2ok-1 ok Sq)= Σp=0N P(o0o1o2ok-1 Sp ok Sq)= Σp=0N P(o0o1o2ok-1 Sp )

P(ok Sq|o0o1o2ok-1 Sp)= Σp=0N F(k-1p) P(ok Sq|Sp)

= Σp=0N F(k-1p) P(Sp Sq)ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 119

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Backward probability B(ki)

Define B(ki)= Probability of seeing okok+1ok+2hellipom given that the state was Si

B(ki)=P(okok+1ok+2hellipom Si ) With m as the length of the whole

observed sequence P(observed sequence)=P(o0o1o2om)

= P(o0o1o2om| S0)=B(00)

21 July 2014Pushpak Bhattacharyya Intro

POS 120

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Backward probability (contd)B(k p)= P(okok+1ok+2hellipom Sp)= P(ok+1ok+2hellipom ok |Sp)= Σq=0N P(ok+1ok+2hellipom ok Sq|Sp)= Σq=0N P(ok Sq|Sp)

P(ok+1ok+2hellipom|ok Sq Sp )= Σq=0N P(ok+1ok+2hellipom|Sq ) P(ok

Sq|Sp)= Σq=0N B(k+1q) P(Sp Sq)

ok

O0 O1 O2 O3 hellip Ok Ok+1 hellip Om-1 Om

S0 S1 S2 S3 hellip Sp Sq hellip Sm Sfinal

21 July 2014Pushpak Bhattacharyya Intro

POS 121

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

How Forward Probability Works

Goal of Forward Probability To find P(O) [the probability of Observation Sequence]

Eg ^ People laugh

^

N N

V V

21 July 2014Pushpak Bhattacharyya Intro

POS 122

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Translation and Lexical Probability Tables

^ N V

^ 0 07 03 0

N 0 02 06 02

V 0 06 02 02

1 0 0 0

ε People Laugh

^ 1 0 0

N 0 08 02

V 0 01 09

1 0 0

Inefficient Computation

21 July 2014Pushpak Bhattacharyya Intro

POS 123

)()()( j

o

iiSS

SSPSOPOPj

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Computation in various paths of the Tree ε People

LaughPath 1 ^ N N

P(Path1) = (10x07)x(08x02)x(02x02)

ε People LaughPath 2 ^ N V

P(Path2) = (10x07)x(08x06)x(09x02)

ε People LaughPath 3 ^ V N

P(Path3) = (10x03)x(01x06)x(02x02)

ε People LaughPath 4 ^ V V

P(Path4) = (10x03)x(01x02)x(09x02)

^

V

N

V

N

V

N

ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 124

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Computations on the TrellisF = accumulated F x output probability x

transition probability

퐹 = 07x10퐹 = 03x10퐹 = 퐹 x (02x03) + 퐹 x (06x01)퐹 = 퐹 x (06x08) + 퐹 x (02x01)퐹 = 퐹 x (02x02) + 퐹 x (02x09)^

N N

V V

퐹ε

ε

People Laugh

21 July 2014Pushpak Bhattacharyya Intro

POS 125

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Number of MultiplicationsTree Each path has 5

multiplications + 1 addition

There are 4 paths in the tree

Therefore total of 20 multiplications and 3 additions

Trellis 퐹 -gt 1 multiplication 퐹 -gt 1 multiplication 퐹 = 퐹 x (1 mult) + 퐹 x

(1 mult)= 4 multiplications + 1

addition Similarly for 퐹 and 퐹 4

multiplications and 1 addition each

So total of 14 multiplications and 3 additions

21 July 2014Pushpak Bhattacharyya Intro

POS 126

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

ComplexityLet |S| = StatesAnd |O| = Observation length - |^ | Stage 1 of Trellis |S| multiplications Stage 2 of Trellis |S| nodes each node needs

computation over |S| arcs Each Arc = 1 multiplication Accumulated F = 1 more multiplication Total 2|푆| multiplications

Same for each stage before reading lsquo rsquo At final stage (lsquo lsquo) -gt 2|S| multiplications Therefore total multiplications = |S| + 2|푆| (|O| -

1) + 2|S|

21 July 2014Pushpak Bhattacharyya Intro

POS 127

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Summary Forward Algorithm

1 Accumulate F over each stage of trellis2 Take sum of F values multiplied by

푃(푆 rarr푆 )3 Complexity = |S| + 2|푆| (|O| - 1) + 2|S|

= 2|푆| |O| - 2|푆| + 3|S|= O(|푆| |O|)

ie linear in the length of input and quadratic in number of states

21 July 2014Pushpak Bhattacharyya Intro

POS 128

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Exercise

1 Backward Probabilitya) Derive Backward Algorithmb) Compute its complexity

2 Express P(O) in terms of both Forward and Backward probability

21 July 2014Pushpak Bhattacharyya Intro

POS 129

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Possible project topics (will keep adding)

Scrabble auto-completion of words (human vs mc)

Humour detection using wordnet (incongruity theory)

Multistage POS tagging

21 July 2014Pushpak Bhattacharyya Intro

POS 130

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131

Reading List TnT (httpwwwaclweborganthology-newAA00A00-

1031pdf)

Brill Tagger (httpdeliveryacmorg10114510800001075553p112-brillpdfip=182191671ampacc=OPENampCFID=129797466ampCFTOKEN=72601926amp__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay (httpwwwcseiitbacinpbpapersACL-2006-Hindi-POS-Taggingpdf)

Projection (httpwwwdipanjandascomfilesposInductionpdf)

21 July 2014Pushpak Bhattacharyya Intro

POS 131