Cs626 Lect1to4 Intro Pos

7/27/2019 Cs626 Lect1to4 Intro Pos

1/131

Speech, NLP and the Web

Pushpak Bhattacharyya

CSE Dept.,

IIT Bombay

Lecture 1-4: Introduction, POS

21 July, 2014

Pushpak Bhattacharyya: Intro,

POS 1


2/131

Basic information

Slot 4: Mon- 11.30, Tue- 8.30, Thu- 9.30AM

Venue: F.C. Kohli auditorium

TA team: Aditya, Geetanjali, Sandeep, Sagar, Naman

[email protected]

[email protected] [email protected]

[email protected]

[email protected]

Course notes: http://www.cse.iitb.ac.in/~pb/cs626-2014

No midsem, end sem, assignments and paper reading for newentrants, projects for others

21 July, 2014


POS 2


3/131

NLP- a foundation: Noisy Channel Model

Sequence w is transformed into sequence t

T*=argmax(P(T|W))= argmax(P(T).P(W|T))w w

W*=argmax(P(W|T))= argmax(P(W).P(T|W))T T

W t

3


4/131

5 representative problems

using noisy channel modeling

Statistical Spell Checking

Automatic Speech Recognition Part of Speech Tagging: discussed in

detail in subsequent classes

Probabilistic Parsing

Statistical Machine Translation

21 July, 2014


POS 4


5/131

Some general observationsA*= argmax [P(A|B)]

A

= argmax [P(A).P(B|A)]A

Computing and using P(A) and P(B|A), both need

(i) looking at the internal structures of A and B

(ii) making independence assumptions(iii) putting together a computation from smallerparts

21 July, 2014


POS 5


6/131

Corpus

A collection of text called corpus, is used for collectingvarious language data

With annotation: more information, but manual laborintensive

Practice: label automatically; correct manually

The famous Brown Corpus contains 1 million tagged words.

Switchboard: very famous corpora 2400 conversations,

543 speakers, many US dialects, annotated with orthographyand phonetics

21 July, 2014


POS 6


7/131

What is NLP

Branch of AI

2 Goals Science Goal: Understand the way

language operates

Engineering Goal: Build systems that

analyse and generate language; reduce theman machine gap

21 July, 2014


POS 7


8/131

Perpectivising NLP: Areas of AI and

their inter-dependencies

Search

Vision

PlanningMachine

Learning

Knowledge

RepresentationLogic

Expert

SystemsRoboticsNLP

21 July, 2014


POS 8


9/131

NLP: Two pictures

NLP

Vision Speech

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorph

Analysis

Statistics and Probability

+

Knowledge Based

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity

21 July, 2014


POS 9


10/131

Morphology

POS tagging

Chunking

Parsing

Semantics Extraction

Discourse and Corefernce

IncreasedComplexity

OfProcessing

NLP Architecture

21 July, 2014


POS 10


11/131

A famous sentence (1/2)

Buffalo buffaloes Buffalo buffaloesbuffalo buffalo Buffalo buffaloes Buffalobuffaloes buffalo

21 July, 2014


POS 11


12/131

A famous sentence (2/2)

Buffalo buffaloes Buffalo buffaloesbuffalo buffalo Buffalo buffaloes Buffalobuffaloes buffalo

Buffalo:

Animal

City

bully

21 July, 2014


POS 12


13/131

NLP: mu ti ayere ,multidimensional

Morphology

POS tagging

Chunking

Parsing

Semantics

Discourse and Coreference

Increased

ComplexityOfProcessing

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorph

Analysis

Part of SpeechTagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity


14/131

Multilinguality: Indian situation Major streams

Indo European

Dravidian

Sino Tibetan

Austro-Asiatic

Some languages are rankedwithin 20 in the world in termsof the populations speakingthem

Hindi and Urdu: 5th (~500milion)

Bangla: 7th (~300 million)

Marathi 14th (~70 million)


15/131

NLP architecture and stages of

processing- ambiguity at every stage

Phonetics and phonology

Morphology Lexical Analysis

Syntactic Analysis

Semantic Analysis Pragmatics

Discourse

21 July, 2014


POS 15


16/131

Phonetics: processing of speechsound and associated challenges

Homophones: bank (finance) vs. bank (river bank)

Near Homophones: maatraa vs. maatra (hin)

Word Boundary

(aajaayenge) (aa jaayenge (will come) or aaj aayenge

(will come today) I got [ua]plate

His research is in human languages

Disfluency: ah, um, ahem etc.

(near homophone trouble) The king of Abu Dhabi expired and there wasnational mourning for 7 days. Some children were playing in the eveningwhen a person chided them, "Do not play; it is mourning time". Thechildren said, "No it is evening time and we will play".

21 July, 2014


POS 16


17/131

Morphology

POS tagging

Chunking

Parsing



IncreasedComplexity

OfProcessing

NLP Architecture

21 July, 2014


POS 17


18/131

Morphology

Word formation rules from root words

Nouns: Plural (boy-boys); Gender marking (czar-czarina)

Verbs: Tense (stretch-stretched);Aspect (e.g. perfective sit-hadsat); Modality (e.g. request khaanaa khaaiie)

First crucial first step in NLP

Languages rich in morphology: e.g., Dravidian, Hungarian,Turkish

Languages poor in morphology: Chinese, English

Languages with rich morphology have the advantage of easierprocessing at higher stages of processing

A task of interest to computer science: Finite State Machines forWord Morphology

21 July, 2014


POS 18


19/131

Lexical Analysis

Dictionary and word properties

dog

noun (lexical property)take-s-in-plural (morph property)animate (semantic property)4-legged (-do-)

carnivore (-do)

21 July, 2014


POS 19


20/131

Lexical Disambiguation

part of Speech Disambiguation Dog as a noun (animal)

Dog as a verb (to pursue)

Sense Disambiguation Dog (as animal)

Dog (as a very detestable person)

The chair emphasised the need for adult education

Very common in day to day communications

Satellite Channel Ad: Watch what you want, when youwant (two senses of watch)

Ground breaking ceremony/research

(ToI: 14/1/14) India eradicates polio, says WHO

21 July, 2014


POS 20


21/131

Technological developments bring in newterms, additional meanings/nuances for

existing terms Justify as injustify the right margin (word

processing context)

Xeroxed: a new verb

Digital Trace: a new expression

Communifaking: pretending to talk onmobile when you are actually not

Discomgooglation: anxiety/discomfort at

not being able to access internet Helicopter Parenting: over parenting

Obamagain, Obama care, modinomics

21 July, 2014


POS 21


22/131

Ambiguity of Multiwords

The grandfather kicked the bucket after suffering from cancer.

This job is a piece of cake

Put the sweater on

He is the dark horse of the match

Google Translations of above sentences:

.

. . .

21 July, 2014


POS 22


23/131

Ambiguity of Named Entities

Bengali:English: Government is restless at home. (*)Chanchal Sarkar is at home

Amsterdam airport: Baby Changing Room

Hindi: English: everyday bold world

Actually name of a Hindi newspaper in Indore

High degree of overlap between NEs and MWEs

Treat differently - transliterate do not translate

21 July, 2014Pushpak Bhattacharyya: Intro,

POS 23


24/131

Morphology

POS tagging

Chunking

Parsing



IncreasedComplexityOfProcessing

NLP Architecture


POS 24


25/131

Structure

S

NPVP

V NP

I

like mangoes

21 July, 2014

Pushpak Bhattacharyya: Intro,POS

25


26/131

Structural Ambiguity

Scope1.The old men and women were taken to safe locations(old men and women) vs. ((old men) and women)2. No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope?) I saw the mountain with a telescope

(world knowledge: mountain cannot be an instrument of

seeing)

Very ubiquitous: newspaper headline 20 years later, BMCpays father 20 lakhs for causing sons death


POS 26


27/131

Garden pathing

The only minus possibly was the needto face the audience more and more

insightful question answer

The old man the boat

The horse raced past the garden fell


POS 27


28/131

Morphology

POS tagging

Chunking

Parsing




NLP Architecture


POS 28


29/131

Semantic Analysis

Representation in terms of

Predicate calculus/SemanticNets/Frames/Conceptual Dependencies andScripts

John gave a book to Mary

Give action: Agent: John, Object: Book, Recipient:Mary

Challenge: ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance

(Hin) aapko mujhe mithaai khilaanii padegii(ambiguous in Marathi and Bengali too; not inDravidian languages)


POS 29


30/131

Coreference: challenge

Binding of co-referring nouns andpronouns

The monkey ate the banana, because itwas hungry

The monkey ate the banana, because it

was ripe and sweet The monkey ate the banana, because it

was lunch time


POS 30


31/131

Morphology

POS tagging

Chunking

Parsing




NLP Architecture


POS 31


32/131

Pragmatics

Very hard problem Model user intention

Tourist (in a hurry, checking out of the hotel,motioning to the service boy): Boy, go upstairs

and see if my sandals are under the divan. Do notbe late. I just have 15 minutes to catch the train.

Boy (running upstairs and coming back panting):yes sir, they are there.

World knowledge

WHY INDIA NEEDS A SECOND OCTOBER (ToI,2/10/07)


POS 32


33/131

Morphology

POS tagging

Chunking

Parsing




NLP Architecture


POS 33


34/131

Discourse

Processing of sequence of sentencesMother to John:

John go to school. It is open today. Should youbunk? Father will be very angry.

Ambiguity of open

bunk what?Why will the father be angry?

Complex chain of reasoning and application ofworld knowledge

Ambiguity of father

father as parentor

father as headmaster


POS 34


35/131

Complexity of Connected Text

John was returning from schooldejected today was the math test

He couldnt control the class

Teacher shouldnt have made him

responsible

After all he is just a janitor


POS 35


36/131

Textual Humour (1/2)

1. Teacher (angrily): did you miss the class yesterday?Student: not much

2. A man coming back to his parked car sees thesticker "Parking fine". He goes and thanks thepoliceman for appreciating his parking skill.

3. John: I got a Jaguar car for my unemployedyoungest son.Jack: That's a great exchange!


POS 36


37/131

Textual Humour (2/2)

A teacher-student exchange

Teacher: What do you think is the capital ofEthiopia?

Student: What do you think?

Teacher (angrily): I do not think Iknow

Student: I do not think I know 21 July, 2014

Pushpak Bhattacharyya: Intro,POS 37


38/131

Example of Application of Noisy Channel Model:Probabilistic Speech Recognition (Isolated

Word)[8] Problem Definition : Given a sequence of speech

signals, identify the words.

2 steps :

Segmentation (Word Boundary Detection)

Identify the word

Isolated Word Recognition :

Identify W given SS (speech signal)

^

arg max ( | )W

W P W SS


POS 38


39/131

Identifying the word

P(SS|W) = likelihood called phonological model intuitively more tractable!

P(W) = prior probability called language model

^

arg m ax ( | )

arg m ax ( ) ( | )

W

W

W P W S S

P W P S S W

# W appears in the corpus( )

# words in the corpusP W


POS 39


40/131

Ambiguities in the context of

P(SS|W) or P(W|SS) Concerns

Sound Text ambiguity

whether v/s weather

right v/s write

bought v/s bot

Text Sound ambiguity

read (present tense) v/s read (past tense)

lead (verb) v/s lead (noun)


POS 40


41/131

Primitives Phonemes (sound)

Syllables

ASCII bytes (machine representation)


POS 41


42/131

Phonemes Standardized by the IPA (International

Phonetic Alphabet) convention

/t/ sound of t in tag

/d/ sound of d in dog

/D/ sound of the


POS 42


43/131

Syllables

Advise (verb) Advice (noun)

ad viceadvise

Consists of1. Nucleus2. Onset

3. Coda


POS 43


44/131

Pronunciation Dictionary

P(SS|W)= P(t o m ae t o |Word is tomato) = Product of arcprobabilities

t

s4

o m oae

t

aa

end

s1 s2 s3s5

s6 s7

1.0 1.0 1.0 1.01.0

1.0

0.73

0.27

Word

Pronunciation Automaton

Tomato


POS 44


45/131

Foundational question

Generative vs. Discrimnative


POS 45


46/131

How are two entities matched?

Entity A and Entity B:Match(A,B):

Two entities match iff their parts match Match(Parts(A), Parts(B))

Two entities match iff their properties match Match(Properties(A), Properties(B))

Heart of discriminative vs. generative scoring.


POS 46


47/131

Books, Journals, Proceedings

Main Text(s): Natural Language Understanding: James Allan

Speech and NLP: Jurafsky and Martin

Foundations of Statistical NLP: Manning and Schutze

Other References: Statistical NLP: Charniak

Journals Computational Linguistics, Natural Language Engineering, AI, AI

Magazine, IEEE SMC

Conferences ACL, EACL, COLING, MT Summit, EMNLP, IJCNLP, HLT,

ICON, SIGIR, WWW, ICML, ECML


POS 47


48/131

Allied Disciplines

Philosophy Semantics, Meaning of meaning, Logic(syllogism)

Linguistics Study of Syntax, Lexicon, Lexical Semantics etc.

Probability and Statistics Corpus Linguistics, Testing of Hypotheses,

System Evaluation

Cognitive Science Computational Models of Language Processing,

Language Acquisition

Psychology Behavioristic insights into Language Processing,

Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory, Entropy, Random Fields

Computer Sc. & Engg. Systems for NLP


POS 48


49/131

Day wise schedule (1/4)

Day-1: Introduction: NLP as playground for rule based andstatistical techniques

Before break: Complete NLP architecture, Ambiguity, start ofPOS tagging

After Break: NLTK (open source python based framework ofcomprehensive NLP tools), POS tagging assignment

Day-2: Shallow parsing

Before break: Morph analysis and synthesis (segmentation,infection, declension, derivation etc., ), Rule based VsStatistical NLU comparison with POS tagging as case study,

Hidden Markov Model and Viterbi algorithm

After break: POS tagging assignment continued


POS 49


50/131


Day-3: Syntactic Parsing Before break: Parsing- classical and statistical, theory and

techniques

After break: Hands on with probabilistic parser

Day-4: Semantics

Before break: Rule based NLU: case study of semantic graphgeneration through Universal Networking Language (UNL)

After break: continue POS tagging and Parsing assignments


POS 50


51/131


Day-5: Lexical resources Before break: Wordnet, ConceptNet, FrameNet, VerbNet etc.

After break: Hands-on on Lexical Resources, NELL, NEIL

Day-6: Information Extraction, Text classification and basicsearch

Before break: Named Entity Recognition, Text Entailment,Lucene, Nutch etc.

After break: NER Hands-on, basic search, Open IE system


POS 51


52/131


Day-7: Affective NLP (cognitive and culture specific NLP) Before break: Sentiment Analysis, Pragmatics, Intent

recognition (Sarcasm, Thwarting), Eye-Tracking

After break: Machine learning techniques with sentimentanalysis as target

Day-8: Deep Learning

Before break: Word Vectors and embedding, Neural Nets,Neural language models

After break: Discussion on deep learning tool

Day-9 and 10: Projects and quiz


POS 52


53/131

Summary

Both Linguistics and Computation needed

Linguistics is the eye, Computation the body

PhenomenonFomalizationTechniqueExperimentationEvaluationHypothesis Testing

has accorded to NLP the prestige it commands today

Natural Science like approach

Neither Theory Building nor Data Driven Pattern finding canbe ignored


POS 53


54/131

Part of Speech Tagging

With Hidden Markov Model


POS 54


55/131

NLP Trinity

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorph

Analysis

Part of Speech

Tagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity


POS 55


56/131

Part of Speech Tagging POS Tagging: attaches to each word in

a sentence a part of speech tag from a

given set of tags called the Tag-Set

Standard Tag-set : Penn Treebank (for

English).


POS 56


57/131

Example

_ The_DT mechanisms_NNS that_WDT

make_VBP traditional_JJ hardware_NN

are_VBP really_RB being_VBG

obsoleted_VBN by_IN microprocessor-

based_JJ machines_NNS ,_, _ said_VBD

Mr._NNP Benton_NNP ._.


POS 57


58/131

Where does POS tagging fit in

Morphology

POS tagging

Chunking

Parsing




21 July, 2014

Pushpak Bhattacharyya: Intro,POS

58


59/131

Example to illustratecomplexity of POS taggng


POS 59


60/131

POS tagging is disambiguation

That_F former_J Sri_Lanka_N skipper_N and_F ace_Jbatsman_N Aravinda_De_Silva_N is_F a_F man_N of_Ffew_J words_N was_F very_R much_R evident_J on_FWednesday_N when_F the_F legendary_J batsman_N ,_Fwho_F has_V always_R let_V his_N bat_N talk_V ,_Fstruggled_V to_F answer_V a_F barrage_N of_Fquestions_N at_F a_F function_N to_F promote_V the_Fcricket_N league_N in_F the_F city_N ._F

N (noun), V (verb), J (adjective), R(adverb) and F (other, i.e., function words).


POS 60


61/131

POS disambiguation That_F/N/J (that can be complementizer (can be put under F),

demonstrative (can be put under J) or pronoun (can be put under N))

former_J

Sr i_N/J Lanka_N/ J (Sri Lanka together qualify the skipper)

skipper_N/V (skipper can be a verb too)

and_F ace_J/N (ace can be both J and N; Nadal served an ace)

batsman_N/J (batsman can be J as it qualifies Aravinda De Silva)

Aravinda_N De_N Silva_N is_F a_F

man_N/V (man can verb too as inman the boat)

of_F few_J

words_N/V (words can be verb too, as in he words is speechesbeautifully)


POS 61


62/131

Behaviour of That That

That man is known by the company he keeps.(Demonstrative)

Man that is known by the company he keeps,gets a good job. (Pronoun)

That man is known by the company he keeps,is a proverb. (Complementation)

Chaotic systems: Systems where a smallperturbation in input causes a largechange in output


POS 62


63/131

POS disambiguation was_F very_R m uch_R evident_J on_F Wednesday_N

when_F/N (when can be a relative pronoun (put under N) as in Iknow the time when he comes)

the_F legendary_J batsman_N

who_F/N

has_V always_R let_V his_N

bat_N/V

talk_V/N

struggle_V / N

answer_V/N

barrage_N/V

question_N/V

function_N/V

promote_V cricket_N league_N city_N


POS 63


64/131

Mathematics of POS tagging


POS 64


67/131

Generative Model^_^ People_N Jump_V High_R ._.

^ N

V

V

N

A

N

.

LexicalProbabilities

BigramProbabilities

This model is called Generative model.Here words are observed from tags as states.This is similar to HMM.


POS 67


68/131

Typical POS tag steps

Implementation of Viterbi Unigram,

Bigram.

Five Fold Evaluation.

Per POS Accuracy.

Confusion Matrix.


POS 68


69/131

0

0.2

0.4

0.6

0.8

1

1.2

AJ0

AJ0-NN1

AJ0-VVG

AJC

AT0

AV0-AJ0

AVP-PRP

AVQ-CJS

CJS

CJS-PRP

CJT-DT0

CRD-PNI

DT0

DTQ

ITJ

NN1

NN1-NP0

NN1-VVG

NN2-VVZ

NP0-NN1

PNI

PNP

PNX

PRP

PRP-CJS

TO0

VBB

VBG

VBN

VDB

VDG

VDN

VHB

VHG

VHN

VM0

VVB-NN1

VVD-AJ0

VVG

VVG-NN1

VVN

VVN-VVD

VVZ-NN2

Series1

Per POS Accuracy for Bigram

Assumption.


POS 69

S h f i l C f i


70/131

Screen shot of typical ConfusionMatrix

AJ0AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1

AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0

AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0

AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0

AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0

AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0

AJC 2 0 0 0 0 0 69 0 0 11 0 0

AJS 6 0 0 0 0 0 0 38 0 2 0 0

AT0 192 0 0 0 0 0 0 0 7000 13 0 0

AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0

AVP 24 0 0 0 0 0 0 0 1 11 0 737


POS 70


71/131

HMM

Algorithm

Problem

LanguageHindi

Marathi

English

FrenchMorph

Analysis

Part of Speech

Tagging

Parsing

Semantics

CRF

HMM

MEMM

NLPTrinity


POS 71


72/131

A Motivating Example

Urn 1# of Red = 30

# of Green = 50# of Blue = 20

Urn 3# of Red =60

# of Green =10# of Blue = 30

Urn 2# of Red = 10

# of Green = 40# of Blue = 50

Colored Ball choosing


POS 72


73/131

Example (contd.)

U1 U2 U3

U1 0.1 0.4 0.5

U2 0.6 0.2 0.2

U3 0.3 0.4 0.3

Given :

Observation : RRGGBRGR

State Sequence : ??

Not so Easily Computable.

and

R G B

U1 0.3 0.5 0.2

U2 0.1 0.4 0.5

U3 0.6 0.1 0.3


POS 73

Emission probability tableTransition probability table


74/131

Diagrammatic representation (1/2)

U1

U2

U3

0.1

0.2

0.4

0.6

0.4

0.5

0.3

0.2

0.3

R, 0.6

G, 0.1

B, 0.3

R, 0.1

B, 0.5

G, 0.4

B, 0.2

R, 0.3 G, 0.5


POS 74


75/131

Diagrammatic representation (2/2)

U1

U2

U3

R,0.02G,0.08B,0.10

R,0.24G,0.04B,0.12

R,0.06G,0.24B,0.30

R, 0.08G, 0.20B, 0.12

R,0.15G,0.25B,0.10

R,0.18G,0.03B,0.09

R,0.18G,0.03B,0.09

R,0.02G,0.08B,0.10

R,0.03G,0.05B,0.02


POS 75

Cl i bl ith t t


76/131

Classic problems with respect toHMM

1.Given the observation sequence, find thepossible state sequences- Viterbi

2.Given the observation sequence, find itsprobability- forward/backward algorithm

3.Given the observation sequence find theHMM prameters.- Baum-Welch algorithm


POS 76


77/131

Illustration of Viterbi

The start and end are important in asequence.

Subtrees get eliminated due to the Markov

Assumption.

POS Tagset

N(noun), V(verb), O(other) [simplified]

^ (start), . (end) [start & end states]


78/131

Illustration of Viterbi

Lexicon

people: N, V

laugh: N, V

.

.

.

Corpora for Training

^ w11_t11 w12_t12 w13_t13 .w1k_1_t1k_1 .

^ w21_t21 w22_t22 w23_t23 .w2k_2_t2k_2 .

.

.

^ wn1_tn1 wn2_tn2 wn3_tn3 .wnk_n_tnk_n .


79/131

Inference

^

NN

NV

.

^ N V O .

^ 0 0.6 0.2 0.2 0

N 0 0.1 0.4 0.3 0.2

V 0 0.3 0.1 0.3 0.3

O 0 0.3 0.2 0.3 0.2

. 1 0 0 0 0

This

transition

table will

change fromlanguage to

language

due to

language

divergences.

Partial sequence graph


80/131

Lexical Probability Table

Size of this table = # pos tags in tagset X vocabulary size

vocabulary size = # unique words in corpus

people laugh ...

^ 1 0 0 ... 0

N 0 1x10-3 1x10-5 ... ...

V 0 1x10-6 1x10-3 ... ...

O 0 0 1x10-9

... ...

. 1 0 0 0 0


81/131

Inference

New Sentence:^ people laugh .

p( ^ N N . | ^ people laugh .)

= (0.6 x 0.1) x (0.1 x 1 x 10-3) x (0.2 x 1 x 10-5)

^

NN

NV.


82/131

Computational Complexity

If we have to get the probability of each

sequence and then find maximum among

them, we would run into exponential number

of computations. If |s| = #states (tags + ^ + . )

and |o| = length of sentence ( words + ^ + . )

Then, #sequences = s|o|-2

But, a large number of partial computations

can be reused using Dynamic Programming.


83/131

Dynamic Programming

^

N V O

.3O2V1N .OVN5.OVN4

.OVN .OVN

people

laugh

0.6 x 1.0 =0.6

0.2

0.2

0.6 x 0.1 x10-3 = 6 x10-5

1 0.6 x 0.4x 10-3 = 2.4

-4

2 0.6 x 0.3 x10-3 = 1.8 x

-4

3 0.6 x 0.2 x10-3 = 1.2 x

-4

No need to expand N4and N5 because they

will never be a part of

the winning sequence.


84/131

Computational Complexity

Retain only those N / V / O nodes which ends

in the highest sequence probability.

Now, complexity reduces from |s||o| to

|s|.|o| Here, we followed the Markov assumption of

order 1.


85/131

Points to ponder wrt HMM andViterbi


POS 85


86/131

Viterbi Algorithm Start with the start state.

Keep advancing sequences that are

maximum amongst all those ending inthe same state


POS 86


87/131

Viterbi Algorithm

^

N V O

N V O N V O N V O

(0.6) (0.2) (0.2)

(0.06*10^-3)(0.24*10^-3)

(0.18*10^-3)

(0.06*10^-6)

(0.02*10^-6)

(0.06*10^-6)

(0) (0) (0)

Claim: We do not need to draw all the subtrees in the algorithm

Tree for the sentence: ^ People laugh .

People


POS 87


88/131

Effect of shifting probability

mass Will a word be always given the same tag? No. Consider the example:

^ people the city with soldiers . (i.e.,populate) ^ quickly people the city .

In the first sentence people is most likelyto be tagged as noun, whereas in the

second, probability mass will shift andpeople will be tagged as verb, since itoccurs after an adverb.


POS 88


89/131

Tail phenomenon and

Language phenomenon Long tail Phenomenon: Probability is very low but not zero

over a large observed sequence.

Language Phenomenon: people which is predominantly tagged as Noun displays

a long tail phenomenon. laugh is predominantly tagged as Verb.


POS 89


90/131

Viterbi phenomenon (Markov

process)N1 N2

N V O N V O

(6*10^-5) (6*10^-8)

LAUGH

Next step all the probabilities will be multiplied by identical probability(lexical and transition). So children of N2 will have probability less thanthe children of N1.


POS 90


91/131

What does P(A|B) mean? P(A|B)= P(B|A)

If P(A)=P(B)

P(A|B) means??

Causality?? B causes A??

Sequentiality?? A follows B?


POS 91


92/131

Back to the Urn Example

Here :

S = {U1, U2, U3}

V = { R,G,B}

For observation: O ={o1 on}

And State sequence

Q ={q1 qn}

is

U1 U2 U3

U1 0.1 0.4 0.5

U2 0.6 0.2 0.2

U3 0.3 0.4 0.3

R G B

U1 0.3 0.5 0.2U2 0.1 0.4 0.5

U3 0.6 0.1 0.3

A =

B=

)( 1 ii UqP

92


93/131

Observations and states

O1 O2 O3 O4 O5 O6 O7 O8

OBS: R R G G B R G R

State: S1 S2 S3 S4 S5 S6 S7 S8

Si = U1/U2/U3; A particular state

S: State sequence

O: Observation sequence

S* = best possible state (urn) sequence

Goal: Maximize P(S*|O) by choosing best S


POS 93


94/131

Goal Maximize P(S|O) where S is the State

Sequence and O is the Observation

Sequence

))|((maxarg* OSPS S


POS 94

False Start


95/131

False Start

),|()...,|().,|().|()|(

)|()|(

718213121

8181

OSSPOSSPOSSPOSPOSP

OSPOSP

By Markov Assumption (a statedepends only on the previous state)

),|()...,|().,|().|()|( 7823121 OSSPOSSPOSSPOSPOSP

O1 O2 O3 O4 O5 O6 O7 O8

OBS: R R G G B R G R

State: S1 S2 S3 S4 S5 S6 S7 S8


POS 95


97/131

State Transitions Probability

)|()...|().|().|().()(

)()(

718314213121

81

SSPSSPSSPSSPSPSP

SPSP

By Markov Assumption (k=1)

)|()...|().|().|().()( 783423121 SSPSSPSSPSSPSPSP


POS 97

Ob ti S


98/131

Observation Sequence

probability),|()...,|().,|().|()|( 81718812138112811 SOOPSOOPSOOPSOPSOP

Assumption that ball drawn depends onlyon the Urn chosen

)|()...|().|().|()|( 88332211 SOPSOPSOPSOPSOP

)|()...|().|().|(

).|()...|().|().|().()|(

)|().()|(

88332211

783423121

SOPSOPSOPSOP

SSPSSPSSPSSPSPOSP

SOPSPOSP


POS 98

Grouping terms


99/131

Grouping terms

P(S).P(O|S)

= [P(O0|S0).P(S1|S0)].

[P(O1|S1). P(S2|S1)].[P(O2|S2). P(S3|S2)].

[P(O3|S3).P(S4|S3)].

[P(O4|S4).P(S5|S4)].

[P(O5|S5).P(S6|S5)].

[P(O6|S6).P(S7|S6)].

[P(O7|S7).P(S8|S7)].

[P(O8|S8).P(S9|S8)].

We introduce the statesS0 and S9 as initial

and final statesrespectively.

After S8 the next stateis S9 with probability

1, i.e., P(S9|S8)=1O0 is -transition

O0 O1 O2 O3 O4 O5 O6 O7 O8

Obs: R R G G B R G R

State: S0 S1 S2 S3 S4 S5 S6 S7 S8 S9


POS 99

Introducing useful notation


100/131

Introducing useful notation

S0 S1

S8

S7

S9

S2 S3 S4 S5 S6

O0 O1 O2 O3 O4 O5 O6 O7 O8



RRG G B R

G

R

P(Ok|Sk).P(Sk+1|Sk)=P(SkSk+1)Ok


POS 100


101/131

Probabilistic FSM

(a1:0.3)

(a2:0.4)

(a1:0.2)

(a2:0.3)

(a1:0.1)

(a2:0.2)

(a1:0.3)

(a2:0.2)

The question here is:

what is the most likely state sequence given the output sequence

seen

S1 S2


POS 101


102/131

Developing the tree

Start

S1 S2

S1 S2 S1 S2

S1 S2 S1 S2

1.0 0.0

0.1 0.3 0.2 0.3

1*0.1=0.1 0.3 0.0 0.0

0.1*0.2=0.02 0.1*0.4=0.04 0.3*0.3=0.09 0.3*0.2=0.06

. .

. .

a1

a2

Choose the winning

sequence per state

per iteration

0.2 0.4 0.3 0.2


POS 102


103/131

Tree structure contd

S1 S2

S1 S2 S1 S2

0.1 0.3 0.2 0.3

0.027 0.012..

0.09 0.06

0.09*0.1=0.009 0.018

S1

0.3

0.0081

S2

0.2

0.0054

S2

0.4

0.0048

S1

0.2

0.0024

.

a1

a2

The problem being addressed by this tree is )|(maxarg* ,2121 aaaaSPSs

a1-a2-a1-a2 is the output sequence and the model or the machine


POS 103


104/131

Path found:(working backward)

S1 S2 S1 S2 S1

a2a1a1 a2

Problem statement: Find the best possible sequence

),|(maxarg* OSPSs

MachineorModelSeq,OutputSeq,State, OSwhere

},,,{MachineorModel 0 TASS

Start symbol State collection Alphabet

set

Transitions

T is defined as kjijki SaSP ,,)(


POS 104

Tabular representation of the


105/131

Tabular representation of the

tree

a1 a2 a1 a2

S11.0 (1.0*0.1,0.0*0.2

)=(0.1,0.0)(0.02,0.09)

(0.009, 0.012) (0.0024,0.0081)

S20.0 (1.0*0.3,0.0*0.3

)=(0.3,0.0)(0.04,0.0

6)(0.027,0.018) (0.0048,0.005

4)

Ending state

Latest symbol

observed

Note: Every cell records the winning probability ending in that state

Final winner

The bold faced values in each cell shows the sequence probability ending in thatstate. Going backward from final winner sequence which ends in state S2 (indicatedBy the 2nd tuple), we recover the sequence.


POS 105

Algorithm


106/131

Algorithm(following James Alan, Natural Language Understanding

(2nd

edition), Benjamin Cummins (pub.), 1995

Given:1. The HMM, which means:

a. Start State: S1

b. Alphabet: A = {a1, a2, ap}c. Set of States: S = {S1, S2, Sn}

d. Transition probability

which is equal to

2. The output string a1a2aT

To find:The most likely sequence of states C1C2CT which produces thegiven output sequence, i.e., C1C2CT =

kjijk

i SaSP ,,)(

)|,( ikj SaSP

],,...,|([maxarg 21 TC

aaaCP


POS 106


107/131

Algorithm contd

Data Structure:1. A N*T array called SEQSCORE to maintain the

winner sequence always (N=#states, T=length ofo/p sequence)

2. Another N*T array called BACKPTR to recover thepath.

Three distinct steps in the Viterbi implementation1.

Initialization2. Iteration

3. Sequence Identification


POS 107

1 Initialization


108/131

1. Initialization

SEQSCORE(1,1)=1.0

BACKPTR(1,1)=0For(i=2 to N) do

SEQSCORE(i,1)=0.0

[expressing the fact that first state is S1]

2. Iteration

For(t=2 to T) do

For(i=1 to N) do

SEQSCORE(i,t) = Max(j=1,N)

BACKPTR(I,t) = index jthat gives the MAX above

)](*))1(,([ SiaSjPtjSEQSCORE k


POS 108


109/131

3. Seq. Identification

C(T) = i that maximizes SEQSCORE(i,T)

For i from (T-1) to 1 do

C(i) = BACKPTR[C(i+1),(i+1)]

Optimizations possible:1. BACKPTR can be 1*T

2. SEQSCORE can be T*2

Homework:- Compare this with A*, Beam Search [Homework]Reason for this comparison:

Both of them work for finding and recovering sequence


POS 109

Viterbi Algorithm for the Urn problem (first two


110/131

Viterbi Algorithm for the Urn problem (first twosymbols)

S0

U1 U2 U3

0.5

0.3

0.2

U1 U2 U3

0.03

0.08

0.15

U1 U2 U3 U1 U2 U3

0.06

0.02

0.020.18

0.24

0.18

0.015 0.04 0.075* 0.018 0.006 0.006 0.048* 0.036

R

21 July, 2014 Pushpak Bhattacharyya: Intro,POS

110

Markov process of order>1 (say 2)


111/131

Markov process of order 1 (say 2)

Same theory works

P(S).P(O|S)

= P(O0|S0).P(S1|S0).[P(O1|S1). P(S2|S1S0)].

[P(O2|S2). P(S3|S2S1)].

[P(O3|S3).P(S4|S3S2)].

[P(O4|S4).P(S5|S4S3)].

[P(O5|S5).P(S6|S5S4)].

[P(O6|S6).P(S7|S6S5)].

[P(O7|S7).P(S8|S7S6)].

[P(O8|S8).P(S9|S8S7)].

We introduce the statesS0 and S9 as initial

and final statesrespectively.

After S8 the next stateis S9 with probability

1, i.e., P(S9|S8S7)=1O0 is -transition

O0 O1 O2 O3 O4 O5 O6 O7 O8



21 July, 2014



112/131

Probability of observationsequence


POS 112

Why probability of observation


113/131

y p ysequence?: Language modeling problem

1.P(The sun rises in the east)2.P(The sun rise in the east)

Less probable because of grammaticalmistake.

3.P(The svn rises in the east) Less probable because of lexical mistake.

4.P(The sun rises in the west) Less probable because of semantic mistake.

Probabilities computed in the context of corpora


POS 113


114/131

Uses of language model

1.Detect well-formedness Lexical, syntactic, semantic, pragmatic,

discourse2.Language identification

Given a piece of text what language does itbelong to.Good morning - EnglishGuten morgen - German

Bon jour - French3.Automatic speech recognition4.Machine translation


POS 114

How to compute


115/131

How to compute

P(o0o1o2o3om)ationMarginaliz)|()(

S

SOPOP

Consider the observation sequence,

13210

210

..

......

mm SSSSSS

OmOOO

Where Si s represent the state sequences.


POS 115


116/131

Computing P(o0o1o2o3om)

)]|().|()].....[|().|()[()|().....|().|(

).|()....|().|().(

)|...()...(

)|()(),(

101000

1100

112010

2101210

mmmm

mm

mm

mm

SSPSOPSSPSOPSPSOPSOPSOP

SSPSSPSSPSP

SOOOOPSSSSP

SOPSPSOP


POS 116


117/131

Forward and BackwardProbability Calculation


POS 117

Forward probability F(k i)


118/131

Forward probability F(k,i)

Define F(k,i)= Probability of being instate Si having seen o0o1o2ok

F(k,i)=P(o0o1o2ok , Si )

With m as the length of the observedsequence

There are N states

P(observed sequence)=P(o0o1o2..om)=p=0,N P(o0o1o2..om , Sp)

=p=0,N F(m , p)21 July, 2014


Forward probability (contd )


119/131

Forward probability (contd.)F(k , q)

= P(o0o1o2..ok, Sq)= P(o0o1o2..ok, Sq)

= P(o0o1o2..ok-1 , ok ,Sq)

= p=0,N P(o0o1o2..ok-1 , Sp , ok ,Sq)

= p=0,N P(o0o1o2..ok-1 , Sp ).

P(ok ,Sq|o0o1o2..ok-1 , Sp)

= p=0,N F(k-1,p). P(ok ,Sq|Sp)

= p=0,N F(k-1,p). P(Sp Sq)ok

O0 O1 O2 O3 Ok Ok+1 Om-1 Om

S0 S1 S2 S3 Sp Sq Sm Sfinal


POS 119

Backward probability B(k i)


120/131

Backward probability B(k,i)

Define B(k,i)= Probability of seeingokok+1ok+2om given that the state wasSi

B(k,i)=P(okok+1ok+2om \ Si ) With m as the length of the whole

observed sequence

P(observed sequence)=P(o0

o1

o2

..om

)

= P(o0o1o2..om| S0)

=B(0,0)21 July, 2014


Backward probability (contd.)


121/131

p y ( )B(k , p)

= P(okok+1ok+2om \ Sp)= P(ok+1ok+2om , ok|Sp)

= q=0,N P(ok+1ok+2om , ok , Sq|Sp)

= q=0,N P(ok ,Sq|Sp)

P(ok+1ok+2om|ok ,Sq ,Sp )= q=0,N P(ok+1ok+2om|Sq ). P(ok ,

Sq|Sp)

= q=0,N B(k+1,q). P(Sp Sq)ok

O0 O1 O2 O3 Ok Ok+1 Om-1 Om

S0 S1 S2 S3 Sp Sq Sm Sfinal


POS 121

How Forward Probability


122/131

How Forward Probability

Works Goal of Forward Probability: To find P(O)

[the probability of Observation Sequence].

E.g. ^ People laugh .

^ .

N N

V V


POS 122

Translation and Lexical


123/131

Translation and Lexical

Probability Tables^ N V .^ 0 0.7 0.3 0

N 0 0.2 0.6 0.2

V 0 0.6 0.2 0.2

. 1 0 0 0

People Laugh

^ 1 0 0

N 0 0.8 0.2

V 0 0.1 0.9

. 1 0 0

Inefficient Computation:


POS 123

)(),()( jo

i

iSS

SSPSOPOPj

Computation in various paths


124/131

Computation in various paths

of the Tree PeopleLaugh

Path 1: ^ N N.

P(Path1) = (1.0x0.7)x(0.8x0.2)x(0.2x0.2) People

LaughPath 2: ^ N V

.

P(Path2) = (1.0x0.7)x(0.8x0.6)x(0.9x0.2) People

LaughPath 3: ^ V N

.P(Path3) = (1.0x0.3)x(0.1x0.6)x(0.2x0.2)

PeopleLaughPath 4: ^ V V

.P(Path4) = (1.0x0.3)x(0.1x0.2)x(0.9x0.2)

^

V

N

V

N

V

N

.

.

.

.

People Laugh


POS 124


125/131

Computations on the TrellisF = accumulated F x output pr obabili ty xtr ansiti on pr obability

= 0.7x1.0

= 0.3x1.0

= x (0.2x0.3) + x (0.6x0.1) = x (0.6x0.8) + x (0.2x0.1)

= x (0.2x0.2) + x (0.2x0.9)^ .N N

V V

People Laugh


POS 125


126/131

Number of Multiplications

Tree

Each path has 5multiplications + 1addition.

There are 4 paths in thetree.

Therefore, total of 20multiplications and 3

additions.

Trellis , -> 1 multiplication , -> 1 multiplication = x (1 mult) + x

(1 mult)= 4 multiplications + 1

addition Similarly, for and , 4

multiplications and 1addition each.

So, total of 14multiplications and 3additions.


POS 126

Complexity


127/131

Complexity

Let |S| = #StatesAnd |O| = Observation length - |{^, .}|

Stage 1 of Trellis: |S| multiplications

Stage 2 of Trellis: |S| nodes; each node needscomputation over |S| arcs.

Each Arc = 1 multiplication

Accumulated F = 1 more multiplication

Total 2|| multiplications

Same for each stage before reading .

At final stage ( . ) -> 2|S| multiplications Therefore, total multiplications = |S| + 2|| (|O| -

1) + 2|S|


POS 127

Summary : Forward Algorithm


128/131

Summary : Forward Algorithm

1. Accumulate F over each stage of trellis.

2. Take sum of F values multiplied by

(

).3. Complexity = |S| + 2|| (|O| - 1) + 2|S|

= 2|||O| - 2||+ 3|S|

= O(||

. |O|)i.e., linear in the length of input and quadraticin number of states.


POS 128


129/131

Exercise1. Backward Probability

a) Derive Backward Algorithm.

b)

Compute its complexity.2. Express P(O) in terms of both Forward and

Backward probability.


POS 129

Possible project topics (will


130/131

Possible project topics (will

keep adding) Scrabble: auto-completion of words

(human vs. m/c)

Humour detection using wordnet(incongruity theory)

Multistage POS tagging


POS 130

Reading List


131/131

g

TnT(http://www.aclweb.org/anthology-new/A/A00/A00-

1031.pdf)

Brill Tagger(http://delivery.acm.org/10.1145/1080000/1075553/p112-brill.pdf?ip=182.19.16.71&acc=OPEN&CFID=129797466&CFTO

KEN=72601926&__acm__=1342975719_082233e0ca9b5d1d67a9997c03a649d1)

Hindi POS Tagger built by IIT Bombay(http://www.cse.iitb.ac.in/pb/papers/ACL-2006-Hindi-POS-Tagging.pdf)

Projection(http://www.dipanjandas.com/files/posInduction.pdf)

Documents

Cs626 Lect1to4 Intro Pos