Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf ·...

Sentence splitting, tokenization, language modeling, and part-of-

speech tagging with hidden Markov models

Yoshimasa Tsuruoka

Topics

• Sentence splitting

• Tokenization

• Maximum likelihood estimation (MLE)

• Language models

– Unigram

– Bigram

– Smoothing

• Hidden Markov models (HMMs)

– Part-of-speech tagging

– Viterbi algorithm

raw(unstructured)

part-of-speechtagging

named entityrecognition

deepsyntacticparsing

annotated(structured)

text processing

dictionary ontology

………………………....... Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells. ……………………

Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells .

NN IN NN VBZ VBN IN NN IN JJ NN NNS .

protein_molecule organic_compound cell_line

PP PP NP

negative regulation

Basic Steps of Natural Language Processing

• Tokenization

• Part-of-speech tagging

• Shallow parsing

• Named entity recognition

• Syntactic parsing

• (Semantic Role Labeling)

Sentence splitting

Current immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines. However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs. Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection.

Current immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines.

However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs.

Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection.

A heuristic rule for sentence splitting

sentence boundary

= period + space(s) + capital letter

Regular expression in Perl

s/\. +([A-Z])/\.\n$1/g;

Errors

• Two solutions:

– Add more rules to handle exceptions

– Machine learning

IL-33 is known to induce the production of Th2-associated cytokines (e.g. IL-5 and IL-13).

IL-33 is known to induce the production of Th2-associated cytokines (e.g.

IL-5 and IL-13).

Adding rules to handle exceptions

#!/usr/bin/perl

while(<STDIN>) {$_ =~ s/([\.\?]) +([\(0-9a-zA-Z])/$1\n$2/g;

$_ =~ s/(\WDr\.)\n/$1 /g;$_ =~ s/(\WMr\.)\n/$1 /g;$_ =~ s/(\WMs\.)\n/$1 /g;$_ =~ s/(\WMrs\.)\n/$1 /g;$_ =~ s/(\Wvs\.)\n/$1 /g;$_ =~ s/(\Wa\.m\.)\n/$1 /g;$_ =~ s/(\Wp\.m\.)\n/$1 /g;$_ =~ s/(\Wi\.e\.)\n/$1 /g;$_ =~ s/(\We\.g\.)\n/$1 /g;

print;}

Tokenization

• Tokenizing general English sentences is relatively straightforward.

• Use spaces as the boundaries

• Use some heuristics to handle exceptions

``We’re heading into a recession.’’

`` We ’re heading into a recession . ’’

Exceptions

• separate possessive endings or abbreviated forms from preceding words:

– Mary’s → Mary ’sMary’s → Mary isMary’s → Mary has

• separate punctuation marks and quotes from words :

– Mary. → Mary .

– “new” → “ new ’’

Tokenization

• Tokenizer.sed: a simple script in sed

• http://www.cis.upenn.edu/~treebank/tokenization.html

• Undesirable tokenization– original: “1,25(OH)2D3”

– tokenized: “1 , 25 ( OH ) 2D3”

• Tokenization for biomedical text– Not straight-forward

– Needs dictionary? Machine learning?

Maximum likelihood estimation

• Learning parameters from data

• Coin flipping example

– We want to know the probability that a biased coin comes up heads.

– Data: {H, T, H, H, T, H, T, T, H, H, T}

6Head =P Why?

• Likelihood

– Data: {H, T, H, H, T, H, T, T, H, H, T}

• Maximize the (log) likelihood

( ) ( ) ( ) ( ) ( ) ( )

( )56 1

11111Data

θθθθθθθθθθθ

−⋅⋅⋅−⋅−⋅⋅−⋅⋅⋅−⋅=P

( ) ( )θθ −+= 1log5log6Datalog P

56Datalog=

−−=

θθθd

Language models

• A language model assigns a probability to a word sequence

• Statistical machine translation

– Noisy channel models: machine translation, part-of-speech tagging, speech recognition, optical character recognition, etc.

( )( ) ( )

( )( ) ( )ee|f

ee|ff|e PP

PPP ∝=

( )nwwP ...1

English sentence

French sentence Language model for English

Language modeling

• How do you compute ?

Learn from data (a corpus)

He opened the window . 0.0000458

She opened the window . 0.0000723

John was hit . 0.0000244

John hit was . 0.0000002

( )nwwP ...1nww ...1

( )nwwP ...1

Estimate probabilities from a corpus

Humpty Dumpty sat on a wall .Humpty Dumpty had a great fall .

1.wallaonsatDumptyHumpty =P

1.fallgreatahadDumptyHumpty =P

( ) 0.fallahadDumptyHumpty =P

Corpus

Sequence to words

• Let’s decompose the sequence probability into probabilities of individual words

( ) ( ) ( )( ) ( ) ( )

( ) ( ) ( ) ( )

( )∏=

wwwwwPwwwPwwPwP

wwwwPwwPwP

wwwPwPwwP

3214213121

213121

|...||

|...... ( ) ( ) ( )xyPxPyxP |, =

Chain rule

Unigram model

• Ignore the context completely

• Example

( ) ( )

( )∏

wwwPwwP

111 ...|...

( )( ) ( ) ( ) ( ) ( ) ( ) ( ).wallaonsatDumptyHumpty

.wallaonsatDumptyHumpty

PPPPPPP

MLE for unigram models

• Count the frequencies

( )( )

( )∑=

2Humpty =P

1on =P

Unigram model

• Problem

• Word order is not considered.

PPPPPPP

.wallaonsatHumptyDumpty

PPPPPPP

Bigram model

• 1st order Markov assumption

• Example

( ) ( )

( )∏

wwwPwwP

...|...

( )( ) ( ) ( )( ) ( ) ( ) ( )wall|.a|wallon|asat|on

Dumpty|satHumpty|Dumptys|Humpty

MLE for bigram models

• Count frequencies

( )( )

−− =

wwCwwP

1Dumpty|sat =P

1Dumpty|had =P

N-gram models

• n-1 order Markov assumption

• Modeling with a large n

– Pros: rich contextual information

– Cons: sparseness (unreliable probability estimation)

( ) ( )

( )∏

−+−

wwwPwwP

...|...

Web 1T 5-gram data

• Released from Google– http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?c

atalogId=LDC2006T13

– 1 trillion word tokens of text from Web pages

• Examples

serve as the incoming 92

serve as the incubator 99

serve as the independent 794

serve as the index 223

serve as the indication 72

serve as the indicator 120

Smoothing

• Problems with the maximum likelihood estimation method

– Events that are not observed in the training corpus are given zero probabilities

– Unreliable estimates when the counts are small

Add-One Smoothing

• Bigram counts table

Humpty Dumpty sat on a wall had great fall .

Humpty 0 2 0 0 0 0 0 0 0 0

Dumpty 0 0 1 0 0 0 1 0 0 0

sat 0 0 0 1 0 0 0 0 0 0

on 0 0 0 0 1 0 0 0 0 0

a 0 0 0 0 0 1 0 1 0 0

wall 0 0 0 0 0 0 0 0 0 1

had 0 0 0 0 1 0 0 0 0 0

great 0 0 0 0 0 0 0 0 1 0

fall 0 0 0 0 0 0 0 0 0 1

. 0 0 0 0 0 0 0 0 0 0

Add-One Smoothing

• Add one to all counts

Humpty Dumpty sat on a wall had great fall .

Humpty 1 3 1 1 1 1 1 1 1 1

Dumpty 1 1 2 1 1 1 2 1 1 1

sat 1 1 1 2 1 1 1 1 1 1

on 1 1 1 1 2 1 1 1 1 1

a 1 1 1 1 1 2 1 2 1 1

wall 1 1 1 1 1 1 1 1 1 2

had 1 1 1 1 2 1 1 1 1 1

great 1 1 1 1 1 1 1 1 2 1

fall 1 1 1 1 1 1 1 1 1 2

. 1 1 1 1 1 1 1 1 1 1

Add-One Smoothing

• V: the size of the vocabulary

( )( )( ) VwC

wwCwwP

−−

( )102

11Dumpty|sat

Part-of-speech tagging

• Assigns a part-of-speech tag to each word in the sentence.

• Part-of-speech tagsNN: Noun IN: Preposition

NNP: Proper noun VBD: Verb, past tense

DT: Determiner VBN: Verb, past participle

Paul Krugman, a professor at Princeton University, wasNNP NNP , DT NN IN NNP NNP , VBDawarded the Nobel Prize in Economics on Monday.

VBN DT NNP NNP IN NNP IN NNP

Part-of-speech tagging is not easy

• Parts-of-speech are often ambiguous

• We need to look at the context

• But how?

I have to go there.

I had a go at it.

Writing rules for part-of-speech tagging

• If the previous word is “to”, then it’s a verb.

• If the previous word is “a”, then it’s a noun.

• If the next word is …

Writing rules manually is impossible

I have to go there.

I had a go at it.

Learning from examples

The involvement of ion channels in B and T lymphocyte activation isDT NN IN NN NNS IN NN CC NN NN NN VBZ

supported by many reports of changes in ion fluxes and membraneVBN IN JJ NNS IN NNS IN NN NNS CC NN

…………………………………………………………………………………….…………………………………………………………………………………….

Machine Learning Algorithm

training

We demonstrate that …

Unseen text

We demonstratePRP VBPthat …IN

• Input:

– Word sequence

• Output (prediction):

– The most probable tag sequence

( ) ( )( )

( ) ( )TWPTP

|argmax

nttT ...1=

nwwW ...1=

constant

Assumptions

• Assume the word emission probability depends only on its tag

( ) ( )

( ) ( )( ) ( ) ( )

( )∏=

ttwwwP

ttwwwwPttwwPttwP

ttwwwPttwP

ttwwPTWP

121311211

......|

...|......|...|

...|......|

...|...|

( ) ( )∏=

ii twPTWP1

Part-of-speech tagging with Hidden Markov Models

emission probabilitytransition probability

( ) ( )

( ) ( )∏=

−∈

twPttP

TWPTPT

1 ||argmax

|argmax

• Assume the first-order Markov assumption for tags

• Then, the most probably tag sequence can be expressed as

( ) ( )∏=

−≈n

ii ttPTP1

Learning from training data

The involvement of ion channels in B and T lymphocyte activation isDT NN IN NN NNS IN NN CC NN NN NN VBZ

supported by many reports of changes in ion fluxes and membraneVBN IN JJ NNS IN NNS IN NN NNS CC NN

…………………………………………………………………………………….…………………………………………………………………………………….

( )( )( )1

−− =

ttCttP

( )( )( )i

twCtwP =|

transition probability

emission probability

Examples from the Wall Street Journal corpus (1)

( )( )( )( )( ) 02.0DT|CD

07.0DT|NNS

11.0DT|NNP

22.0DT|JJ

47.0DT|NN

POS Tag Description Examples

DT determiner a, the

NN noun, singular or mass company

NNS noun plural companies

NNP proper nouns, singular Nasdaq

IN preposition/subordinating conjunction in, of, like

VBD verb, past tense took

( )( )( )( )( ) 05.0|VBD

08.0|.

11.0|,

12.0NN|NN

25.0NN|IN

Examples from the Wall Street Journal corpus (2)

( )( )( )( )( ) 0107.0NN|share

0155.0NN|market

0167.0NN|year

019.0NN|company

037.0NN|%

POS Tag Description Examples

DT determiner a, the

NN noun, singular or mass company

NNS noun plural companies

NNP proper nouns, singular Nasdaq

IN preposition/subordinating conjunction in, of, like

VBD verb, past tense took

( )( )( )( )( ) 022.0VBD|rose

056.0VBD|had

064.0VBD|were

131.0VBD|was

188.0VBD|said

• Input:– Word sequence

• Output (prediction):– The most probable tag sequence nttT ...1=

nwwW ...1=

( ) ( )∏=

−∈

twPttPT1

1 ||argmaxτ

Finding the best sequence

• Naive approach:– Enumerate all possible tag sequences with their probabilities and

select the one that gives the highest probability

DTNNNNSNNPVB:

exponential in the length of the sentence

Finding the best sequence

• if you write them down…

( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( ) ( )

DT|DT|DTNN|NNS|NNNNS|s|NNS

DT|DT|DTNN|NN|NNNN|s|NN

DT|DT|DTNN|DT|NNDT|s|DT

DT|DT|DTDT|DT|DTNNS|s|NNS

DT|DT|DTDT|DT|DTNN|s|NN

DT|DT|DTDT|DT|DTDT|s|DT

wPPwPPwPP

A lot of redundancy in computation

There should be a way to do it more efficiently

Viterbi algorithm (1)

• Let be the probability of the best sequence for ending with the tag

( ) ( ) ( )

( ) ( )

( ) ( ) ( ) ( )

( ) ( ) ( )111

1,...,

||max||max

||maxmax

−−−

−−

kkkkkkt

iiiittt

iiiitttt

iiiittt

tVtwPttP

twPttPtwPttP

twPttP

twPttPtV

( )kk tV

ktkww ...1

We can compute them in a recursive manner!

( ) ( ) ( ) ( )111 ||max1

−−−−

= kkkkkkt

kk tVtwPttPtVk

( ) 1s0 =><V

Beginning of the sentence

( ) ( ) ( )( ) ( ) ( )( ) ( ) ( )( ) ( ) ( )

NNPDT|NNP|DT

NNSDT|NNS|DT

NNDT|NN|DT

DTDT|DT|DT

• Left to right

– Save backward pointers to recover the best tag sequence

Complexity: (number of tags)2 x (length)

He has opened it

Topics

• Tokenization

• Maximum likelihood estimation (MLE)

• Language models

– Unigram

– Bigram

– Smoothing

• Hidden Markov models (HMMs)

– Part-of-speech tagging

– Viterbi algorithm

Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf ·...

Documents

Tokenization - MeaWallet · support card tokenization. Token 1 Token 2 Token 3 Token Token Introduction A significant benefit of tokenization is the fact that a token is only valid

EMV Payment Tokenization Primer and Lessons Learned · developed as a primer on payment tokenization as defined by EMVCo in the Payment Tokenization Technical Framework.1 EMV payment

VORMETRIC VAULTLESS TOKENIZATION WITH DYNAMIC DATA …enterprise-encryption.vormetric.com/rs/...Tokenization_White_Paper.pdf · White Paper Page | 3 Vormetric Vaultless Tokenization

EMV, Encryption and Tokenization

Tokenization Whitepaper

Sukuk Tokenization: Successful Experiences

Enterprise Tokenization with SecurDPS - comforte · 2020-01-16 · Enterprise Tokenization with SecurDPS Core Concepts & Architecture ... SB_Enterprise_Tokenization_12_1_2020.qxp_Layout

Tokenization Guidelines Info Supplement

What is Payment Tokenization

Tokenization Buyers Guide

Critical Tokenization and its Properties

Tokenization Integration Guide

Tokenization and Word Segmentation - cuni.cz

Tokenization: What's Next After PCI?

pci tokenization

CipherCloud Technology Overview: Tokenization

CTI Tokenization Concepts 160408B

Rethinking mode splitting, splitting in general, …people.atmos.ucla.edu/alex/ROMS/Honolulu2010Talk.pdfRethinking mode splitting, splitting in general, Boussinesq, non-Boussinesq,

Tokenization Guidelines

Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep