Named Entity Tagging Thanks to Dan Jurafsky, Jim Martin, Ray Mooney, Tom Mitchell for slides

Named Entity Named Entity TaggingTagging

Thanks to Dan Jurafsky, Jim Martin, Ray Mooney, Tom Mitchell for slides

OutlineOutline

Named Entities and the basic ideaNamed Entities and the basic idea IOB TaggingIOB Tagging A new classifier: Logistic RegressionA new classifier: Logistic Regression

Linear regression Logistic regression Multinomial logistic regression = MaxEnt

Why classifiers aren’t as good as sequence Why classifiers aren’t as good as sequence modelsmodels

A new sequence model:A new sequence model: MEMM = Maximum Entropy Markov Model

Named Entity TaggingNamed Entity Tagging

Slide from Jim Martin

CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York.

Named Entity TaggingNamed Entity Tagging

CHICAGOCHICAGO (AP) — Citing high fuel prices, (AP) — Citing high fuel prices, United AirlinesUnited Airlines said said Friday it has increased fares by $6 per round trip on flights to Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. some cities also served by lower-cost carriers. American American AirlinesAirlines, a unit , a unit AMRAMR, immediately matched the move, , immediately matched the move, spokesman spokesman Tim WagnerTim Wagner said. said. UnitedUnited, a unit of , a unit of UALUAL, said the , said the increase took effect Thursday night and applies to most routes increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as where it competes against discount carriers, such as ChicagoChicago to to DallasDallas and and AtlantaAtlanta and and DenverDenver to to San Francisco, Los AngelesSan Francisco, Los Angeles and and New York.New York.


Named Entity Named Entity RecognitionRecognition Find the named entities and classify them by typeFind the named entities and classify them by type Typical approachTypical approach

Acquire training data Encode using IOB labeling Train a sequential supervised classifier Augment with pre- and post-processing using available

list resources (census data, gazetteers, etc.)


Temporal and Numerical Temporal and Numerical ExpressionsExpressions TemporalsTemporals

Find all the temporal expressions Normalize them based on some reference point

Numerical ExpressionsNumerical Expressions Find all the expressions Classify by type Normalize


NE TypesNE Types


NE Types: ExamplesNE Types: Examples


AmbiguityAmbiguity


Biomedical EntitiesBiomedical Entities

DiseaseDisease SymptomSymptom DrugDrug Body PartBody Part TreatmentTreatment EnzimeEnzime ProteinProtein Difficulty: discontiguous or overlapping mentionsDifficulty: discontiguous or overlapping mentions

Abdomen is soft, nontender, nondistended, negative bruits

NER ApproachesNER Approaches

As with partial parsing and chunking there are As with partial parsing and chunking there are two basic approaches (and hybrids)two basic approaches (and hybrids) Rule-based (regular expressions)

• Lists of names• Patterns to match things that look like names• Patterns to match the environments that

classes of names tend to occur in. ML-based approaches

• Get annotated training data• Extract features• Train systems to replicate the annotation


ML ApproachML Approach


Encoding for Sequence Encoding for Sequence LabelingLabeling We can use IOB encoding:We can use IOB encoding:

……United AirlinesUnited Airlines said Friday it has increased said Friday it has increasedB_ORG I_ORG O O O O O

the move , spokesman Tim Wagner said.

O O O O B_PER I_PER O

How many tags?How many tags? For N classes we have 2*N+1 tags

• An I and B for each class and one O for no-class

Each token in a text gets a tagEach token in a text gets a tag Can use simpler IO tagging if what?Can use simpler IO tagging if what?

NER FeaturesNER Features


How to do NE tagging?How to do NE tagging?

ClassifiersClassifiers Naïve Bayes Logistic Regression

Sequence ModelsSequence Models HMMs MEMMs CRFs

Sequence models work betterSequence models work better

Linear RegressionLinear Regression

Example from Freakonomics (Levitt and Example from Freakonomics (Levitt and Dubner 2005)Dubner 2005) Fantastic/cute/charming versus granite/maple

Can we predict price from # of adjs?Can we predict price from # of adjs?

Linear RegressionLinear Regression

Muliple Linear RegressionMuliple Linear Regression

Predicting values:Predicting values:

In general:In general:

Let’s pretend an extra “intercept” feature f0 with value 1

Multiple Linear RegressionMultiple Linear Regression

Learning in Linear Learning in Linear RegressionRegression Consider one instance Consider one instance xxjj

We’d like to choose weights to minimize the We’d like to choose weights to minimize the difference between predicted and observed difference between predicted and observed value for value for xxjj::

This is an optimization problem that turns out to This is an optimization problem that turns out to have a closed-form solutionhave a closed-form solution

Put the weight from the training set into matrix Put the weight from the training set into matrix XX of observations of observations ff((ii))

Put the observed values in a vector Put the observed values in a vector yyFormula that mimimizes the cost:Formula that mimimizes the cost:

W = W = ((XXTTXX))−−11XXTTyy

Logistic RegressionLogistic Regression

Logistic RegressionLogistic Regression

But in these language problems we are doing But in these language problems we are doing classificationclassification Predicting one of a small set of discrete values

Could we just use linear regression for this?Could we just use linear regression for this?

Logistic regressionLogistic regression Not possible: the result doesn’t fall between 0 and Not possible: the result doesn’t fall between 0 and

11

Instead of predicting prob, predict ratio of probs:Instead of predicting prob, predict ratio of probs:

but still not good: doesn’t lie between 0 and 1

So how about if we predict the log:So how about if we predict the log:

Logistic regressionLogistic regression Solving this for Solving this for pp((y=truey=true))

Logistic functionLogistic function

Logistic RegressionLogistic Regression How do we do classification?How do we do classification?

Or:Or:

Or back to explicit sum notation:Or back to explicit sum notation:

Multinomial logistic Multinomial logistic regressionregressionMultiple classes:Multiple classes:

One change: indicator functions One change: indicator functions ff((c,xc,x)) instead of real valuesinstead of real values

Estimating the weightEstimating the weight

Gradient Iterative ScalingGradient Iterative Scaling

FeaturesFeatures

Summary so farSummary so far

Naïve Bayes ClassifierNaïve Bayes Classifier Logistic Regression ClassifierLogistic Regression Classifier

Sometimes called MaxEnt classifiers

How do we apply How do we apply classification to classification to sequences?sequences?

Sequence Labeling as Sequence Labeling as ClassificationClassification Classify each token independently but use as Classify each token independently but use as

input features, information about the input features, information about the surrounding tokens (sliding window).surrounding tokens (sliding window).

Slide from Ray Mooney

John saw the saw and decided to take it to the table.

classifier

NNP




classifier

VBD





classifier

DT





classifier

NN





classifier

CC





classifier

VBD





classifier

TO





classifier

VB





classifier

PRP





classifier

IN





classifier

DT





classifier

NN


Using Outputs as InputsUsing Outputs as Inputs

Better input features are usually the Better input features are usually the categoriescategories of the surrounding tokens, of the surrounding tokens, but these are not available yetbut these are not available yet

Can use category of either the Can use category of either the preceding or succeeding tokens by preceding or succeeding tokens by going forward or back and using going forward or back and using previous outputprevious output


Forward ClassificationForward Classification


classifier

NNP



NNPJohn saw the saw and decided to take it to the table.

classifier

VBD



NNP VBDJohn saw the saw and decided to take it to the table.

classifier

DT



NNP VBD DTJohn saw the saw and decided to take it to the table.

classifier

NN



NNP VBD DT NNJohn saw the saw and decided to take it to the table.

classifier

CC



NNP VBD DT NN CCJohn saw the saw and decided to take it to the table.

classifier

VBD



NNP VBD DT NN CC VBDJohn saw the saw and decided to take it to the table.

classifier

TO



NNP VBD DT NN CC VBD TOJohn saw the saw and decided to take it to the table.

classifier

VB


Backward ClassificationBackward Classification

Disambiguating “to” in this case would be Disambiguating “to” in this case would be even easier backward.even easier backward.

DT NNJohn saw the saw and decided to take it to the table.

classifier

IN




IN DT NNJohn saw the saw and decided to take it to the table.

classifier

PRP




PRP IN DT NNJohn saw the saw and decided to take it to the table.

classifier

VB




VB PRP IN DT NNJohn saw the saw and decided to take it to the table.

classifier

TO




TO VB PRP IN DT NN John saw the saw and decided to take it to the table.

classifier

VBD




VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.

classifier

CC




CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.

classifier

VBD




VBD CC VBD TO VB PRP IN DT NNJohn saw the saw and decided to take it to the table.

classifier

DT




DT VBD CC VBD TO VB PRP IN DT NNJohn saw the saw and decided to take it to the table.

classifier

VBD




VBD DT VBD CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.

classifier

NNP


NER as Sequence LabelingNER as Sequence Labeling

Why classifiers aren’t as Why classifiers aren’t as good as sequence modelsgood as sequence models

Problems with using Classifiers Problems with using Classifiers for Sequence Labelingfor Sequence Labeling

It’s not easy to integrate information It’s not easy to integrate information from hidden labels on both sidesfrom hidden labels on both sides

We make a hard decision on each We make a hard decision on each tokentoken We’d rather choose a global optimum The best labeling for the whole sequence Keeping each local decision as just a

probability, not a hard decision

Probabilistic Sequence Probabilistic Sequence ModelsModelsProbabilistic sequence models allow Probabilistic sequence models allow

integrating uncertainty over multiple, integrating uncertainty over multiple, interdependent classifications and interdependent classifications and collectively determine the most likely collectively determine the most likely global assignmentglobal assignment

Two standard modelsTwo standard models Hidden Markov Model (HMM) Conditional Random Field (CRF) Maximum Entropy Markov Model (MEMM)

is a simplified version of CRF

HMMs vs. MEMMsHMMs vs. MEMMs






HMM (top) and MEMM HMM (top) and MEMM (bottom)(bottom)

Viterbi in MEMMsViterbi in MEMMs We condition on the observation AND the previous state:We condition on the observation AND the previous state:

HMM decoding:HMM decoding:

Which is the HMM version of:Which is the HMM version of:

MEMM decoding: MEMM decoding:

Decoding in MEMMsDecoding in MEMMs

Evaluation MetricsEvaluation Metrics

PrecisionPrecision

Precision: how many of the names we Precision: how many of the names we returned are really names?returned are really names?

Recall: how many of the names in the Recall: how many of the names in the database did we find?database did we find?

Precision Number of correct names given by system

Total number of names given by system

Recall Number of correct names given by system

Total number of actual names in the text

F-measureF-measure

F-measure is a way to combine these:F-measure is a way to combine these:

More generally:More generally:

F-measureF-measure

Harmonic mean is the reciprocal of Harmonic mean is the reciprocal of arthithmetic mean of reciprocals:arthithmetic mean of reciprocals:

Hence F-measure is:Hence F-measure is:

OutlineOutline

Named Entities and the basic ideaNamed Entities and the basic idea IOB TaggingIOB Tagging A new classifier: Logistic RegressionA new classifier: Logistic Regression

Linear regression Logistic regression Multinomial logistic regression = MaxEnt

Why classifiers aren’t as good as sequence Why classifiers aren’t as good as sequence modelsmodels

A new sequence model:A new sequence model: MEMM = Maximum Entropy Markov Model

Documents

Named Entity Tagging Thanks to Dan Jurafsky, Jim Martin, Ray Mooney, Tom Mitchell for slides