Upload
rosaline-holland
View
217
Download
0
Embed Size (px)
Citation preview
Named Entity Named Entity TaggingTagging
Thanks to Dan Jurafsky, Jim Martin, Ray Mooney, Tom Mitchell for slides
OutlineOutline
Named Entities and the basic ideaNamed Entities and the basic idea IOB TaggingIOB Tagging A new classifier: Logistic RegressionA new classifier: Logistic Regression
Linear regression Logistic regression Multinomial logistic regression = MaxEnt
Why classifiers aren’t as good as sequence Why classifiers aren’t as good as sequence modelsmodels
A new sequence model:A new sequence model: MEMM = Maximum Entropy Markov Model
Named Entity TaggingNamed Entity Tagging
Slide from Jim Martin
CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York.
Named Entity TaggingNamed Entity Tagging
CHICAGOCHICAGO (AP) — Citing high fuel prices, (AP) — Citing high fuel prices, United AirlinesUnited Airlines said said Friday it has increased fares by $6 per round trip on flights to Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. some cities also served by lower-cost carriers. American American AirlinesAirlines, a unit , a unit AMRAMR, immediately matched the move, , immediately matched the move, spokesman spokesman Tim WagnerTim Wagner said. said. UnitedUnited, a unit of , a unit of UALUAL, said the , said the increase took effect Thursday night and applies to most routes increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as where it competes against discount carriers, such as ChicagoChicago to to DallasDallas and and AtlantaAtlanta and and DenverDenver to to San Francisco, Los AngelesSan Francisco, Los Angeles and and New York.New York.
Slide from Jim Martin
Named Entity Named Entity RecognitionRecognition Find the named entities and classify them by typeFind the named entities and classify them by type Typical approachTypical approach
Acquire training data Encode using IOB labeling Train a sequential supervised classifier Augment with pre- and post-processing using available
list resources (census data, gazetteers, etc.)
Slide from Jim Martin
Temporal and Numerical Temporal and Numerical ExpressionsExpressions TemporalsTemporals
Find all the temporal expressions Normalize them based on some reference point
Numerical ExpressionsNumerical Expressions Find all the expressions Classify by type Normalize
Slide from Jim Martin
NE TypesNE Types
Slide from Jim Martin
NE Types: ExamplesNE Types: Examples
Slide from Jim Martin
AmbiguityAmbiguity
Slide from Jim Martin
Biomedical EntitiesBiomedical Entities
DiseaseDisease SymptomSymptom DrugDrug Body PartBody Part TreatmentTreatment EnzimeEnzime ProteinProtein Difficulty: discontiguous or overlapping mentionsDifficulty: discontiguous or overlapping mentions
Abdomen is soft, nontender, nondistended, negative bruits
NER ApproachesNER Approaches
As with partial parsing and chunking there are As with partial parsing and chunking there are two basic approaches (and hybrids)two basic approaches (and hybrids) Rule-based (regular expressions)
• Lists of names• Patterns to match things that look like names• Patterns to match the environments that
classes of names tend to occur in. ML-based approaches
• Get annotated training data• Extract features• Train systems to replicate the annotation
Slide from Jim Martin
ML ApproachML Approach
Slide from Jim Martin
Encoding for Sequence Encoding for Sequence LabelingLabeling We can use IOB encoding:We can use IOB encoding:
……United AirlinesUnited Airlines said Friday it has increased said Friday it has increasedB_ORG I_ORG O O O O O
the move , spokesman Tim Wagner said.
O O O O B_PER I_PER O
How many tags?How many tags? For N classes we have 2*N+1 tags
• An I and B for each class and one O for no-class
Each token in a text gets a tagEach token in a text gets a tag Can use simpler IO tagging if what?Can use simpler IO tagging if what?
NER FeaturesNER Features
Slide from Jim Martin
How to do NE tagging?How to do NE tagging?
ClassifiersClassifiers Naïve Bayes Logistic Regression
Sequence ModelsSequence Models HMMs MEMMs CRFs
Sequence models work betterSequence models work better
Linear RegressionLinear Regression
Example from Freakonomics (Levitt and Example from Freakonomics (Levitt and Dubner 2005)Dubner 2005) Fantastic/cute/charming versus granite/maple
Can we predict price from # of adjs?Can we predict price from # of adjs?
Linear RegressionLinear Regression
Muliple Linear RegressionMuliple Linear Regression
Predicting values:Predicting values:
In general:In general:
Let’s pretend an extra “intercept” feature f0 with value 1
Multiple Linear RegressionMultiple Linear Regression
Learning in Linear Learning in Linear RegressionRegression Consider one instance Consider one instance xxjj
We’d like to choose weights to minimize the We’d like to choose weights to minimize the difference between predicted and observed difference between predicted and observed value for value for xxjj::
This is an optimization problem that turns out to This is an optimization problem that turns out to have a closed-form solutionhave a closed-form solution
Put the weight from the training set into matrix Put the weight from the training set into matrix XX of observations of observations ff((ii))
Put the observed values in a vector Put the observed values in a vector yyFormula that mimimizes the cost:Formula that mimimizes the cost:
W = W = ((XXTTXX))−−11XXTTyy
Logistic RegressionLogistic Regression
Logistic RegressionLogistic Regression
But in these language problems we are doing But in these language problems we are doing classificationclassification Predicting one of a small set of discrete values
Could we just use linear regression for this?Could we just use linear regression for this?
Logistic regressionLogistic regression Not possible: the result doesn’t fall between 0 and Not possible: the result doesn’t fall between 0 and
11
Instead of predicting prob, predict ratio of probs:Instead of predicting prob, predict ratio of probs:
but still not good: doesn’t lie between 0 and 1
So how about if we predict the log:So how about if we predict the log:
Logistic regressionLogistic regression Solving this for Solving this for pp((y=truey=true))
Logistic functionLogistic function
Logistic RegressionLogistic Regression How do we do classification?How do we do classification?
Or:Or:
Or back to explicit sum notation:Or back to explicit sum notation:
Multinomial logistic Multinomial logistic regressionregressionMultiple classes:Multiple classes:
One change: indicator functions One change: indicator functions ff((c,xc,x)) instead of real valuesinstead of real values
Estimating the weightEstimating the weight
Gradient Iterative ScalingGradient Iterative Scaling
FeaturesFeatures
Summary so farSummary so far
Naïve Bayes ClassifierNaïve Bayes Classifier Logistic Regression ClassifierLogistic Regression Classifier
Sometimes called MaxEnt classifiers
How do we apply How do we apply classification to classification to sequences?sequences?
Sequence Labeling as Sequence Labeling as ClassificationClassification Classify each token independently but use as Classify each token independently but use as
input features, information about the input features, information about the surrounding tokens (sliding window).surrounding tokens (sliding window).
Slide from Ray Mooney
John saw the saw and decided to take it to the table.
classifier
NNP
Sequence Labeling as Sequence Labeling as ClassificationClassification Classify each token independently but use as Classify each token independently but use as
input features, information about the input features, information about the surrounding tokens (sliding window).surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
VBD
Slide from Ray Mooney
Sequence Labeling as Sequence Labeling as ClassificationClassification Classify each token independently but use as Classify each token independently but use as
input features, information about the input features, information about the surrounding tokens (sliding window).surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
DT
Slide from Ray Mooney
Sequence Labeling as Sequence Labeling as ClassificationClassification Classify each token independently but use as Classify each token independently but use as
input features, information about the input features, information about the surrounding tokens (sliding window).surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
NN
Slide from Ray Mooney
Sequence Labeling as Sequence Labeling as ClassificationClassification Classify each token independently but use as Classify each token independently but use as
input features, information about the input features, information about the surrounding tokens (sliding window).surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
CC
Slide from Ray Mooney
Sequence Labeling as Sequence Labeling as ClassificationClassification Classify each token independently but use as Classify each token independently but use as
input features, information about the input features, information about the surrounding tokens (sliding window).surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
VBD
Slide from Ray Mooney
Sequence Labeling as Sequence Labeling as ClassificationClassification Classify each token independently but use as Classify each token independently but use as
input features, information about the input features, information about the surrounding tokens (sliding window).surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
TO
Slide from Ray Mooney
Sequence Labeling as Sequence Labeling as ClassificationClassification Classify each token independently but use as Classify each token independently but use as
input features, information about the input features, information about the surrounding tokens (sliding window).surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
VB
Slide from Ray Mooney
Sequence Labeling as Sequence Labeling as ClassificationClassification Classify each token independently but use as Classify each token independently but use as
input features, information about the input features, information about the surrounding tokens (sliding window).surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
PRP
Slide from Ray Mooney
Sequence Labeling as Sequence Labeling as ClassificationClassification Classify each token independently but use as Classify each token independently but use as
input features, information about the input features, information about the surrounding tokens (sliding window).surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
IN
Slide from Ray Mooney
Sequence Labeling as Sequence Labeling as ClassificationClassification Classify each token independently but use as Classify each token independently but use as
input features, information about the input features, information about the surrounding tokens (sliding window).surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
DT
Slide from Ray Mooney
Sequence Labeling as Sequence Labeling as ClassificationClassification Classify each token independently but use as Classify each token independently but use as
input features, information about the input features, information about the surrounding tokens (sliding window).surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
NN
Slide from Ray Mooney
Using Outputs as InputsUsing Outputs as Inputs
Better input features are usually the Better input features are usually the categoriescategories of the surrounding tokens, of the surrounding tokens, but these are not available yetbut these are not available yet
Can use category of either the Can use category of either the preceding or succeeding tokens by preceding or succeeding tokens by going forward or back and using going forward or back and using previous outputprevious output
Slide from Ray Mooney
Forward ClassificationForward Classification
John saw the saw and decided to take it to the table.
classifier
NNP
Slide from Ray Mooney
Forward ClassificationForward Classification
NNPJohn saw the saw and decided to take it to the table.
classifier
VBD
Slide from Ray Mooney
Forward ClassificationForward Classification
NNP VBDJohn saw the saw and decided to take it to the table.
classifier
DT
Slide from Ray Mooney
Forward ClassificationForward Classification
NNP VBD DTJohn saw the saw and decided to take it to the table.
classifier
NN
Slide from Ray Mooney
Forward ClassificationForward Classification
NNP VBD DT NNJohn saw the saw and decided to take it to the table.
classifier
CC
Slide from Ray Mooney
Forward ClassificationForward Classification
NNP VBD DT NN CCJohn saw the saw and decided to take it to the table.
classifier
VBD
Slide from Ray Mooney
Forward ClassificationForward Classification
NNP VBD DT NN CC VBDJohn saw the saw and decided to take it to the table.
classifier
TO
Slide from Ray Mooney
Forward ClassificationForward Classification
NNP VBD DT NN CC VBD TOJohn saw the saw and decided to take it to the table.
classifier
VB
Slide from Ray Mooney
Backward ClassificationBackward Classification
Disambiguating “to” in this case would be Disambiguating “to” in this case would be even easier backward.even easier backward.
DT NNJohn saw the saw and decided to take it to the table.
classifier
IN
Slide from Ray Mooney
Backward ClassificationBackward Classification
Disambiguating “to” in this case would be Disambiguating “to” in this case would be even easier backward.even easier backward.
IN DT NNJohn saw the saw and decided to take it to the table.
classifier
PRP
Slide from Ray Mooney
Backward ClassificationBackward Classification
Disambiguating “to” in this case would be Disambiguating “to” in this case would be even easier backward.even easier backward.
PRP IN DT NNJohn saw the saw and decided to take it to the table.
classifier
VB
Slide from Ray Mooney
Backward ClassificationBackward Classification
Disambiguating “to” in this case would be Disambiguating “to” in this case would be even easier backward.even easier backward.
VB PRP IN DT NNJohn saw the saw and decided to take it to the table.
classifier
TO
Slide from Ray Mooney
Backward ClassificationBackward Classification
Disambiguating “to” in this case would be Disambiguating “to” in this case would be even easier backward.even easier backward.
TO VB PRP IN DT NN John saw the saw and decided to take it to the table.
classifier
VBD
Slide from Ray Mooney
Backward ClassificationBackward Classification
Disambiguating “to” in this case would be Disambiguating “to” in this case would be even easier backward.even easier backward.
VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.
classifier
CC
Slide from Ray Mooney
Backward ClassificationBackward Classification
Disambiguating “to” in this case would be Disambiguating “to” in this case would be even easier backward.even easier backward.
CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.
classifier
VBD
Slide from Ray Mooney
Backward ClassificationBackward Classification
Disambiguating “to” in this case would be Disambiguating “to” in this case would be even easier backward.even easier backward.
VBD CC VBD TO VB PRP IN DT NNJohn saw the saw and decided to take it to the table.
classifier
DT
Slide from Ray Mooney
Backward ClassificationBackward Classification
Disambiguating “to” in this case would be Disambiguating “to” in this case would be even easier backward.even easier backward.
DT VBD CC VBD TO VB PRP IN DT NNJohn saw the saw and decided to take it to the table.
classifier
VBD
Slide from Ray Mooney
Backward ClassificationBackward Classification
Disambiguating “to” in this case would be Disambiguating “to” in this case would be even easier backward.even easier backward.
VBD DT VBD CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.
classifier
NNP
Slide from Ray Mooney
NER as Sequence LabelingNER as Sequence Labeling
Why classifiers aren’t as Why classifiers aren’t as good as sequence modelsgood as sequence models
Problems with using Classifiers Problems with using Classifiers for Sequence Labelingfor Sequence Labeling
It’s not easy to integrate information It’s not easy to integrate information from hidden labels on both sidesfrom hidden labels on both sides
We make a hard decision on each We make a hard decision on each tokentoken We’d rather choose a global optimum The best labeling for the whole sequence Keeping each local decision as just a
probability, not a hard decision
Probabilistic Sequence Probabilistic Sequence ModelsModelsProbabilistic sequence models allow Probabilistic sequence models allow
integrating uncertainty over multiple, integrating uncertainty over multiple, interdependent classifications and interdependent classifications and collectively determine the most likely collectively determine the most likely global assignmentglobal assignment
Two standard modelsTwo standard models Hidden Markov Model (HMM) Conditional Random Field (CRF) Maximum Entropy Markov Model (MEMM)
is a simplified version of CRF
HMMs vs. MEMMsHMMs vs. MEMMs
Slide from Jim Martin
HMMs vs. MEMMsHMMs vs. MEMMs
Slide from Jim Martin
HMMs vs. MEMMsHMMs vs. MEMMs
Slide from Jim Martin
HMM (top) and MEMM HMM (top) and MEMM (bottom)(bottom)
Viterbi in MEMMsViterbi in MEMMs We condition on the observation AND the previous state:We condition on the observation AND the previous state:
HMM decoding:HMM decoding:
Which is the HMM version of:Which is the HMM version of:
MEMM decoding: MEMM decoding:
Decoding in MEMMsDecoding in MEMMs
Evaluation MetricsEvaluation Metrics
PrecisionPrecision
Precision: how many of the names we Precision: how many of the names we returned are really names?returned are really names?
Recall: how many of the names in the Recall: how many of the names in the database did we find?database did we find?
Precision Number of correct names given by system
Total number of names given by system
Recall Number of correct names given by system
Total number of actual names in the text
F-measureF-measure
F-measure is a way to combine these:F-measure is a way to combine these:
More generally:More generally:
F-measureF-measure
Harmonic mean is the reciprocal of Harmonic mean is the reciprocal of arthithmetic mean of reciprocals:arthithmetic mean of reciprocals:
Hence F-measure is:Hence F-measure is:
OutlineOutline
Named Entities and the basic ideaNamed Entities and the basic idea IOB TaggingIOB Tagging A new classifier: Logistic RegressionA new classifier: Logistic Regression
Linear regression Logistic regression Multinomial logistic regression = MaxEnt
Why classifiers aren’t as good as sequence Why classifiers aren’t as good as sequence modelsmodels
A new sequence model:A new sequence model: MEMM = Maximum Entropy Markov Model