Unsupervised Models for Coreference Resolution

Vincent NgHuman Language Technology Research

InstituteUniversity of Texas at Dallas

Plan for the TalkSupervised learning for coreference resolution

how and when supervised coreference research startedstandard machine learning approach

Unsupervised learning for coreference resolutionself-training EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)

three modifications

Machine Learning for Coreference Resolutionstarted in mid-1990s

Connolly et al. (1994), Aone and Bennett (1995), McCarthy and Lehnert (1995)

propelled by availability of annotated corpora produced byMessage Understanding Conferences (MUC-6/7: 1995, 1998)

English onlyAutomatic Content Extraction (ACE 2003, 2004, 2005, 2008)

English, Chinese, Arabic

Machine Learning for Coreference Resolutionstarted in mid-1990s

Connolly et al. (1994), Aone and Bennett (1995), McCarthy and Lehnert (1995)

propelled by availability of annotated corpora produced byMessage Understanding Conferences (MUC-6/7: 1995, 1998)

English onlyAutomatic Content Extraction (ACE 2003, 2004, 2005, 2008)

English, Chinese, Arabic

identified as an important task for information extractionidentity coreference only

Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the

same real-world entity

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

Lots of prior work on supervised coreference resolution

Standard Supervised Learning ApproachClassification

a classifier is trained to determine whether two mentions are coreferent or not coreferent

[Queen Elizabeth] set about transforming [her] [husband], ...

coref ?

not coref ?

coref ?

Standard Supervised Learning ApproachClustering

coordinates possibly contradictory pairwise coreference decisions

husband

King George VI

the King

Clustering Algorithm

Queen Elizabeth

a renowned speech therapist

Queen Elizabeth

[Queen Elizabeth],

set about transforming

[husband]

not coref

King George VI

coordinates possibly contradictory pairwise classification decisions

husband

King George VI

the King

Queen Elizabeth

[Queen Elizabeth],

[husband]

not coref

King George VI

coordinates possibly contradictory pairwise classification decisions

husband

King George VI

the King

Queen Elizabeth

[Queen Elizabeth],

[husband]

not coref

King George VI

Standard Supervised Learning ApproachTypically relies on a large amount of labeled data

What if we only have a small amount of annotated data?

First Attempt: Supervised Learningtrain on whatever annotated data we have

need to specify learning algorithm feature setclustering algorithm

First Attempt: Supervised Learningtrain on whatever annotated data we have

need to specify learning algorithm (Bayes)feature setclustering algorithm (Bell-tree)

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

Coref, Not Coref

finds y* such that

Coref, Not Coref

)...,,,|(maxarg 21*

nYy xxxyPy

finds y* such that

Coref, Not Coref

)...,,,|(maxarg 21*

nYy xxxyPy )|...,,,()(maxarg 21 yxxxPyP nYy

finds y* such that

What features to use in the feature representation?

Coref, Not Coref

)...,,,|(maxarg 21*

nYy xxxyPy )|...,,,()(maxarg 21 yxxxPyP nYy

Linguistic FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Pair Type (ti, tj), where ti, tj { Pronoun, Name, Nominal }

Use 7 linguistic features divided into 3 groups

Linguistic Features

E.g., for the mention pair (Barack Obama, president-elect), the feature value is (Name, Nominal)

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy )|...,,,()(maxarg 721 yxxxPyPYy

COREF or NOT COREF

finds the class value y that is the most probable given the feature vector x1,..,xn

finds y* such that

But we may have a data sparseness problem

The Bayes Classifier

)...,,,|(maxarg 21*

COREF or NOT COREF

finds y* such that

Let’s simplify this term!

)...,,,|(maxarg 21*

COREF or NOT COREF

finds y* such that

Let’s simplify this term!assume that feature values from different groups are

independent of each other given the class

)...,,,|(maxarg 21*

COREF or NOT COREF

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy)|...,,,()(maxarg 721 yxxxPyPYy

COREF or NOT COREF

)|...,,,()(maxarg 721 yxxxPyPYy

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy

These are the model parameters (to be estimated from annotated data using maximum likelihood estimation)

COREF or NOT COREF

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

Generative model: specifies how an instance is generated

COREF or NOT COREF

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

Generate the class y with P(y)

COREF or NOT COREF

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

Generate the class y with P(y) Given y, generate

x1, x2, and x3 with P(x1, x2, x3 | y)

COREF or NOT COREF

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

Given y, generate x4, x5, and x6 with

P(x4, x5, x6 | y)

COREF or NOT COREF

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

Given y, generate x4, x5, and x6 with

P(x4, x5, x6 | y)

Given y, generate x7 with P(x7 | y)

COREF or NOT COREF

train on whatever annotated data we have

need to specify learning algorithm feature set clustering algorithm

First Attempt: Supervised Learning

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1][2]

[12][3]

[1][2]

[12][3]

[13][2]

[1][23]

[1][2][3]

[1][2]

[12][3]

[13][2]

[1][23]

[1][2][3]

[1][2]

[12][3]

[13][2]

[1][23]

[1][2][3]

Leaves contain all the possible partitions of all of the mentions

[1][2]

[12][3]

[13][2]

[1][23]

[1][2][3]

Leaves contain all the possible partitions of all of the mentions

Computationally infeasible to expand all nodes in the Bell tree

[1][2]

[12][3]

[13][2]

[1][23]

[1][2][3]

expands only the most promising nodes

[1][2]

[12][3]

[13][2]

[1][23]

[1][2][3]

expands only the most promising nodes

How to determine which nodes are promising?

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1][2]

1 * Pc(1,2) = 1 * 0.6 = 0.6

[1][2]

1 * (1 - Pc(1,2)) = 1 * (1 - 0.6) = 0.4

[1][2]

[12][3]

[13][2]

[1][23]

[1][2][3]

[1][2]

[12][3]

[13][2]

[1][23]

[1][2][3]

0.6 * max (Pc(1,3), Pc(2,3)) = 0.6 * max(0.2, 0.7) = 0.42

[1][2]

[12][3]

[13][2]

[1][23]

[1][2][3]

[1][2]

[12][3]

[13][2]

[1][23]

[1][2][3]

[1][2]

[12][3]

[13][2]

[1][23]

[1][2][3]

expands only the N most probable nodes at each level

Where are we?We have described

a learning algorithm for training a coreference classifier a clustering algorithm for combining coreference probabilities

Where are we?We have described

a learning algorithm for training a coreference classifier a clustering algorithm for combining coreference probabilities

Goal: evaluate this coreference system in the presence of a small amount of labeled data

The ACE 2003 coreference corpus3 data sets (Broadcast News, Newswire, Newspaper)

each has a training set and a test set use one training text for training the coreference classifier evaluate on the entire test set

Mentions extracted automatically using an NP chunker

Experimental Setup

each has a training set and a test set use one training text for training the coreference classifier evaluate on the entire test set

Scoring programCEAF scoring program (Luo, 2005)

recall, precision, F-measure

Experimental Setup

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Evaluation Results

R P F R P F

Evaluation Results

R P F R P F

Evaluation Results

R P F R P F

Evaluation Results

R P F R P F

Evaluation Results

Can we improve performance by combining a small amount of labeled data and

a potentially large amount of unlabeled data?

Supervised learning for coreference resolutionbrief historystandard machine learning approach

three modifications

Plan for the Talk

Self-Training

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

Self-Training

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

Classifier h

Self-Training

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

Classifier h

Self-Training

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

Classifier h

N most confidently labeled instances

Results (F-measure for Self-Training)

0 1 2 3 4 5 6 7 8 9

Number of Iterations

w/o bagging

0 1 2 3 4 5 6 7 8 9

w/o bagging

Broadcast News Newswire

Why doesn’t Self-Training improve?only the most confidently labeled instances are added in

each iterationthe classifier already knows how to label these newly added

instancesnot much new knowledge is gained by re-training a classifier

from such newly added instances

Why does Self-Training hurt?also due to the bias towards confidently-labeled instances

many confidently labeled instances are pairs of identical proper names

(India, India) (IBM, IBM)

(prince, prince) (Clinton, Clinton)

Coref Coref Coref Coref

(name, name) (name, name) (name, name) (name, name)

Mention Pair Type feature value

the classifier gradually learns that two proper names are likely to be coreferent, regardless of whether the names are identical

(name, name) (name, name) (name, name) (name, name)

Mention Pair Type feature value

Why does Self-Training hurt?Since we hypothesize that the Mention Pair Type feature is

causing the problem …repeat the experiments without using this feature

Results (F-measure for Self-Training)

0 1 2 3 4 5 6 7 8 9

no MP Type feature with MP Type feature

0 1 2 3 4 5 6 7 8 9

no MP Type feature with MP Type feature

Some Lessons Learnedwhen labeled data is scarce, feature design becomes an

important issue

when exploiting unlabeled data, it is crucial to learn from both confidently labeled and not-so-confidently labeled data

three modifications

Plan for the Talk

Unsupervised Coreference as EM Clustering

Exploits unlabeled data by inducing a clustering for an unlabeled document, not by labeling mention pairs

Unsupervised Coreference as EM Clustering

Exploits unlabeled data by inducing a clustering for an unlabeled document, not by labeling mention pairs

the EM-based model is forced to learn from all of the mention pairs when the model is retrained

Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,

where Cij = 1 iff mentions i and j are coreferent

where Cij = 1 iff mentions i and j are coreferent1 2 3 4 5

Coreferent

Not Coreferent

1 2 3 4 5

A clustering C of n mentions is an n x n Boolean matrix, where Cij = 1 iff mentions i and j are coreferent

Representing a Clustering

Don’t care about diagonal entries

1 2 3 4 5

Don’t care about entries below the diagonal

1 2 3 4 5

A clustering C of n mentions is an n x n Boolean matrix, where Cij = 1 iff mentions i and j are coreferent

1 2 3 4 5

Representing a Clustering

Transitive

1 2 3 4 5

Valid Invalid

1 2 3 4 5

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

)|()(),( CDPCPCDP

How to generate D given C?

)|()(),( CDPCPCDP

How to generate D given C? Assume that D is represented by its mention pairs

)|()(),( CDPCPCDP

How to generate D given C? Assume that D is represented by its mention pairs To generate D, generate all pairs of mentions in D

(Queen Elizabeth, her), (Queen Elizabeth, husband), (Queen Elizabeth, King George VI), …

)|()(),( CDPCPCDP

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

mpij is the pair formed from mention i and mention j

Let’s simplify this term

Let’s simplify this term assume that each mention pair mpij is generated

conditionally independently given C ij

)|()(),( CDPCPCDP

)|()()(

DPairs ijij CmpPCP

)|()( ...,14,13,12 CmpmpmpPCP

)|()()(

DPairs ijij CmpPCP

How to represent a mention pair mij?

Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }

Given a document D,generate a clustering C according to P(C)generate D given C

)|()()(

DPairs ijij CmpPCP

The Generative Model

7 feature values

)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

1DPairs ijijijij CmpmpmpPCP

Let’s simplify this term

)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

Let’s simplify this term assume that feature values from different groups are

conditionally independent of each other

)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

)|()|()( 6,

1ijijijijijijijij CmpmpmpPCmpmpmpPCP

)|( 7ijij CmpP

)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

)|()|()( 6,

)|( 7ijij CmpP

)|( 7 cmpP

Model Parameters

)|( 3,

1 cmpmpmpP

)|( 6,

4 cmpmpmpPimp are the feature values

{ Coref, Not Coref }c

)|( 7 cmpP

Model Parameters

)|( 3,

1 cmpmpmpP

)|( 6,

)|( 7 cmpP

Model Parameters

)|( 3,

1 cmpmpmpP

)|( 6,

Model Parameters

)|( 3,

1 cmpmpmpP

)|( 6,

4 cmpmpmpP

)|( 7 cmpP

imp are the feature values

If we had labeled data, we could estimate the parametersBut we don’t have labeled data. So …

Model ParametersUse EM to iteratively

estimate the model parametersprobabilistically induce a clustering for a document

The Induction Algorithm

Given a set of unlabeled documents

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

Initial labelings are presumably noisy

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

3 mentions: 1, 2, 3

[123] [12][3][13][2] [1][23][1][2][3] + invalid clusterings

3 mentions: 1, 2, 3

+ invalid clusterings

[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.21 0.11 0.29 0.05 …

3 mentions: 1, 2, 3

[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.21 0.11 0.29 0.05 …

Iterate till convergence

3 mentions: 1, 2, 3

[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.21 0.11 0.29 0.05 …

How to cope with the computational complexity

of the E-step?

Approximating the E-step

Search for the N most probable clusterings onlyusing the Bell Tree algorithm

[1][2]

[12][3]

[13][2]

[1][23]

[1][2][3]

assign a probability to each possible clustering of the mentions of each document (E-step) use the normalized scores of the 50-best clusterings

three modifications

Plan for the Talk

Haghighi and Klein’s ModelCluster-level model

assigns a cluster id to each mention

1Queen Elizabeth set about transforming her husband,

1 1Queen Elizabeth set about transforming her husband,

assigns a cluster id to each mentionensures transitivity automatically

Haghighi and Klein’s Generative Story

Haghighi and Klein’s Generative StoryFor each mention encountered in a document,

generate a cluster id for the mention (according to some cluster id distribution)

generate the head noun of the mention (according to some cluster-specific head distribution)

Inference: Gibbs sampling

Problem with the model: Too simplistic!mentions with the same head likely to get the same cluster

id two occurrences of “she” will likely be posited as coreferent particularly inappropriate for generating pronouns

Problem with the model: Too simplistic!mentions with the same head likely to get the same cluster id

Extensions:use a separate “pronoun head model” to generate pronounsincorporate salience

Unsupervised learning for coreference resolutionself-training and its variant (Ng and Cardie, 2003)EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)

three modifications relaxed head generation agreement constraints pronoun-only salience

Plan for the Talk

Modification 1: Relaxed Head GenerationMotivation

H&K’s model is linguistically impoverished does not exploit useful knowledge: alias, appositives, …

Modification 1: Relaxed Head GenerationMotivation

H&K’s model is linguistically impoverished does not exploit useful knowledge: alias, appositives, …

Goalsimple method for incorporating such knowledge sources

Modification 1: Relaxed Head Generation

pre-process a document by assigning a “head id” to each mention, such that two mentions have the same head id iffthey are the same stringor they are aliasesor they are in an appositive relation

International Business

Corporation

Barcelona

instead of generating the head noun, generate the head id

Corporation

Barcelona

instead of generating the head noun, generate the head idthe model views “International Business Corporation” and “IBM”

as two mentions having the same head

Corporation

Barcelona

instead of generating the head noun, generate the head idthe model views “International Business Corporation” and “IBM”

as two mentions having the same headencourages the model to put the two into the same cluster

Corporation

Barcelona

Modification 2: Agreement ConstraintsMotivation

gender and number agreement is implemented as a preference, not as a constraint, in H&K’s model

while the model favours the assignment of a pronoun to a gender- and number-compatible cluster

it also favours the assignment of a pronoun to a large cluster

if a cluster is large enough, the model may assign the pronoun to the cluster even if the two are not compatible

Goalimplement gender and number agreement as a constraint

disallow the generation of a mention by any cluster where the two are incompatible in number or gender

Modification 2: Agreement Constraints

Modification 3: Pronoun-Only Salience

In H&K’s model, salience is applied to all types of mentions (pronouns, names and nominals) during cluster assignment

Our hypothesissince names and nominals are less sensitive to salience, the

net benefit of applying salience to names and nominals could be negative as a result of inaccurate modeling of salience

We restrict the application of salience to pronouns only

Improving Haghighi and Klein’s Model3 modifications

relaxed head generationagreement constraintspronoun-only salience

EvaluationEM-based model

Haghighi and Klein’s modelwith and without the 3 modifications

For each data set use one training text for initializing model parameters evaluate on the entire test set

Scoring programCEAF scoring program (Luo, 2005)

Experimental Setup

R P F R P F

Results (Weakly Supervised Baseline)

Train the Bayes classifier on one (labeled) document

Use the Bell Tree clustering algorithm to impose a partition for each test document using the pairwise probabilities

Heuristic BaselineSimple rule-based system

Posits two mentions as coreferent if and only if they arethe same stringaliasesin an appositive relation

R P F R P F

Results (Heuristic Baseline)

EM-Based ModelInitialize the parameters using one (labeled) document

rather than using randomly guessed clusterings

R P F R P F

Results (EM-Based Model)

R P F R P F

Results (EM-Based Model)

gains in both recall and precisionF-measure increases by 5-7%

Duplicated Haghighi and Klein’s Model

Use the same labeled document as in the EM-based model to learn the value of in the Dirichlet Process

R P F R P F

Results (Duplicated H&K’s Model)

R P F R P F

Results (Duplicated H&K’s Model)

In comparison to EM-based modelprecision drops substantiallyF-measure decreases by 10-11%

R P F R P F

Results (Adding 3 Modifications)

R P F R P F

In comparison to Duplicated Haghighi and KleinF-measure improves after the addition of each modification

R P F R P F

In comparison to Duplicated Haghighi and KleinF-measure improves after the addition of each modificationmodest gain in recall and substantial gain in precision when

all modifications are applied (9-10% gain in F-measure)

R P F R P F

Results (Fully-Supervised Resolver)

Trained using C4.5, entire ACE training set, 34 featuresOutperforms the unsupervised models by 7%

Using a Knowledge-Based FeatureAdd a feature to the EM-based model that encodes the

output of a knowledge-based coreference systemimplements heuristics used by different MUC-7 resolvers

Resulting model not so “unsupervised”

R P F R P F

EM-based Model (w/ KB feature) 65.4 53.3 58.8 68.1 58.2 62.8

EM-based Model (w/o KB feature) 57.0 54.6 55.7 62.9 56.5 59.6

Results (EM-Based Model w/ KB Feature)

SummaryExamined unsupervised models for coreference resolution

self-training, EM, Haghighi and Klein’s model require little labeled datafacilitates their application to resource-scarce languages

EM-based model and modified H&K’s model outperform self-training and H&K’s original model

Not as competitive as fully-supervised model, but …

Summary (Cont’)… they can potentially be improved by

incorporating additional linguistic features in

feature engineering remains a challenging issuecombining a large amount of labeled data with a large amount

of unlabeled data

generative modeling is interesting in itself

SummaryExamined unsupervised models for coreference resolution

self-training, EM, Haghighi and Klein’s model require little labeled datafacilitates their application to resource-scarce languages

Self-training with and without baggingDoesn’t improve (and sometimes even hurts) performanceAugment labeled data with only confidently-labeled instancesLittle knowledge is gained by the classifierCareful feature design is an especially important issueNeed to label both confident and not-so-confident instances

Summary (Cont’)EM-based generative model

induces a clustering on an unlabeled documentoutperforms Haghighi and Klein’s coreference model

Three extensions to Haghighi and Klein’s generative model each modification improves F-measure

Not as competitive as fully-supervised modelbut … generative modeling is interesting in itselffeature engineering remains a crucial yet challenging issue

Weakly Supervised BaselineTrain the Naïve Bayes classifier on one (labeled) document

Use the Bell Tree clustering algorithm to impose a partition on each test document using the pairwise probabilities

each has a training set and a test set use one training text for training the Bayes coreference classifier evaluate on the entire test set

Scoring program MUC scoring program (Vilain et al., 1995) ????

Experimental Setup

each has a training set and a test set use one training text for training the Bayes coreference classifier evaluate on the entire test set

Scoring program MUC scoring program (Vilain et al., 1995) ????

2 problems under-penalizes partitions where mentions are over-clustered does not reward successful identification of singleton clusters

Experimental Setup

)|...,,,()(maxarg 721 yxxxPyPYy

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy

These are the model parameters (to be estimated from annotated data using maximum likelihood estimation)

Not as naïve as Naïve Bayes …

COREF or NOT COREF

Results (Self-Training w/ and w/o Bagging)

0 1 2 3 4 5 6 7 8 9

w/ bagging (5 bags) w/o bagging

0 1 2 3 4 5 6 7 8 9

w/ bagging (5 bags) w/o bagging

Self-Training with Bagging

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

Create k training sets, each of size |L|, by sampling from L with replacement

Train k classifiers

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

Bagged Classifier h1

Bagged Classifier hk

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

N labeled instances with the highest average confidence

Why doesn’t Self-Training improve?only the most confidently labeled instances are added in

each iterationthe classifier already knows how to label these newly added

instancesnot much new knowledge is gained by re-training a classifier

from such newly added instances

Need to learn from both the confidently and no-so-confidently labeled instances

Haghighi and Klein’s ModelNonparametric Bayesian model

Enables the use of prior knowledge to put a higher probability on hypotheses deemed more likely

Don’t commit to a particular set of parameters (don’t attempt to compute the most likely hypothesis)

Given a set of mentions X, find the most likely partition Z. Find the Z that maximizes

dXPXZPXZP )|(),|()|(

Given a set of mentions X, find the most likely partition Z. Find the Z that maximizes

dXPXZPXZP )|(),|()|(

Integrate out the parameters

Encode prior knowledge on hypotheses

[1][2]

[12][3]

[13][2]

[1][23]

[1][2][3]

[1][2]

[12][3]

[13][2]

[1][23]

[1][2][3]

expands only the most promising paths

[1][2]

[12][3]

[13][2]

[1][23]

[1][2][3]

expands only the most promising paths

How to determine which paths are promising?

[1][2]

[12][3]

[13][2]

[1][23]

[1][2][3]

0.6*(1- max (Pc(1,3), Pc(2,3))) = 0.6 * (1- max(0.2, 0.7)) = 0.58

[1][2]

[12][3]

[13][2]

[1][23]

[1][2][3]

brief historystandard machine learning approach

Unsupervised learning for coreference resolutionself-training and its variant (Ng and Cardie, 2003)EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)

three modifications

given a description of two mentions, mi and mj, classify the pair as coreferent or not coreferent

create one training instance for each pair of mentions from texts annotated with coreference information feature vector: describes the two mentions

train a classifier using a machine learning algorithm decision tree learner (C5), maximum entropy, SVMs

coref ?

not coref ?

coref ?

Related WorkApply a weakly supervised or unsupervised learning

algorithm to pronoun resolution

co-training (Müller et al., 2002)

self-training (Kehler et al., 2004)

Heuristics

How to compute the semantic class of a mention?

How to compute the semantic class of a mention? Proper names: use a named entity recognizer Nominals: induced from an unannotated corpus

Inducing Semantic ClassesGoal: induce the semantic class of a nominal, focusing on

PERSON, ORGANIZATION, LOCATION, and OTHERS

Inducing Semantic ClassesGoal: induce the semantic class of a nominal, focusing on

PERSON, ORGANIZATION, LOCATION, and OTHERS

Given a large, unannotated corpus

Use a parser to extract appositive relations <Eastern Airlines, carrier>, <George Bush, president>, …

Use a named entity recognizer to find the semantic classes of the proper names

Infer the semantic class of a nominal from the associated proper name

Potential Problems Named entity recognizer is not perfect

Mislabels proper names

Parser is not perfect Extracts mention pairs that are not in apposition

Potential Problems Named entity recognizer is not perfect

Mislabels proper names

Parser is not perfect Extracts mention pairs that are not in apposition

To improve robustness:1. Compute the probability that the nominal co-occurs with each

of the named entity types

2. If the most likely NE type has a probability above 0.7, label the nominal with the most likely NE type

MUC CEAF MUC CEAF

Weakly Supervised Baseline 38.0 49.0 42.8 53.5

Heuristic Baseline 36.4 48.4 43.2 54.2

Our EM-based Model 51.6 55.7 57.8 59.6

Duplicated Haghighi and Klein 45.2 45.2 41.9 48.8

+ Relaxed Head Generation 47.0 47.5 45.0 52.6

+ Agreement Constraints 48.9 51.4 46.0 54.5

+ Pronoun-only Salience 52.6 54.7 50.0 57.4

Fully Supervised Model 60.4 61.8 60.6 64.5

MUC and CEAF F-Scores

MUC CEAF MUC CEAF

Similar performance trends across the 2 scoring programs

Experiments using Perfect MentionsPerfect mentions are NPs marked up in the answer key

using them makes the coreference task somewhat easier

Similar performance trends observedexcept that the unsupervised models perform comparably to

the fully-supervised resolver

Conclusions drawn from system mentions are not always generalizable to perfect mentions and vice versa

SummaryPresented an EM-based model for unsupervised

coreference resolution that outperforms Haghighi and Klein’s coreference model

compares favourably to a modified version of their model

H&K’s Model: Salience ModelingEach entity/cluster is initially assigned a salience value of 0As we process the discourse, the salience value of each

entity will changeWhen we encounter a mention, we update the salience scores

(* 0.5 for each entity and add 1 to current entity)Then discretize the salience values

5 buckets: TOP, HIGH, MID, LOW, NONEUsing a separate corpus, estimate the probability of

P(mention type | Salience)where mention type can be pronoun, name, or nominal. E.g.,

P(pronoun | TOP) is a large value P(nominal | TOP) is a small value

model is sensitive to these estimated values

Why Salience Modeling?Important for pronouns

For H&K, since they don’t use features like apposition, modeling salience may allow mentions in an appositive to be assigned the same cluster id.

Parameter Initialization = 0.4 (true mention) and 0.7 (system mentions) concentration parameter: e-4

Parameter Initialization

Uses one (labeled) document taken from the training set toinitialize the parameters of our EM-based modeldetermine the concentration parameter, , in H&K’s model

Experiments with Perfect MentionsSimilar performance trends observed

except that the unsupervised models perform comparably to the fully-supervised resolver

Conclusions drawn from perfect mentions are not always generalizable to system mentions and vice versa

Results obtained using perfect mentions should not be compared against those obtained using system mentions

Degenerate EM BaselineModel obtained after one iteration of EM

No parameter re-estimation on the unlabeled data

R P F R P F

Degenerate EM Baseline 70.8 36.3 48.0 69.0 25.1 36.8

Haghighi and Klein Baseline 50.8 40.7 45.2 43.0 40.9 41.9

Degenerate EM Baseline: MUC Results

Degenerate EM Baseline: MUC ResultsBroadcast News Newswire

Experiments on System Mentions R P F R P F

large gain in recall and large drop in precision (over-clustering)

F-score increases for one data set and drops for the other

EM-Based Model: MUC Results

In comparison to Degenerate EMlarge drop in recall, but larger gain in precisionF-score increases by 4-21%gains attributed to exploitation of unlabeled data

R P F R P F

MUC CEAF CEAFV MUC CEAF CEAFV

MUC, CEAF, CEAF-Variant F-Scores

Degenerate EM Baseline performs the worst

EM-based Model outperforms Heuristic Baseline

Addition of each extension yields improvements in F-score

Extended H&K system performs comparably with EM-based model

Unsupervised models lag performance of the supervised model

Unsupervised Coreference as EM ClusteringDesign a generative model that can be used to induce a

clustering of the mentions in a given document

Exploit pairwise linguistic constraints gender and number agreement, semantic compatibility, …

Facilitates the incorporate of pairwise linguistic constraints

1 2 3 4 5

Valid Invalid

String match Alias (one is an acronym or abbreviation of the other) Appositive

Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Proper, Common }

Use 7 linguistic features

Features

Computing the E-stepGoal: assign a probability to each possible clustering of the

mentions in a document

Computationally intractable: number of clusterings is exponential in the number of mentions

Search for the N most probable clusterings only

Search for the N most probable clusterings onlyusing Luo et al.’s (2004) search algorithm

structure the search space as a Bell tree

A Bell Tree

[1][2]

[12][3]

[13][2]

[1][23]

[1][2][3]

The Bell-Tree Search AlgorithmFinds the N most probable paths from the root to a leaf

using a beam search

The probability of a clustering (or partition) is the probability assigned to the corresponding path

Degenerate EM Baselinemodel that is obtained after one iteration of EM

initializes model parameters based on labeled documentapplies the model (and Bell tree search) to obtain the most

probable coreference partition

no parameter re-estimation on the unlabeled data

Noun Phrase CoreferenceIdentify the noun phrases (or mentions) that refer to the

Partition the set of mentions into coreference equivalence classes

King George VI, into a viable monarch. A renowned

speech therapist, was summoned to help the King

overcome his speech impediment...

Supervised Coreference Resolution

Lots of prior work on supervised coreference resolutionSoon et al. (2001), Strube et al. (2002), Yang et al. (2003),

Luo et al. (2004), Denis and Baldridge (2007), …

1 2 3 4 5

Reflexivity

Search for the N most probable clusterings onlyusing Luo et al.’s (2004) search algorithm

structures the search space as a Bell tree takes as input the pairwise coreference probabilities scores a clustering based on these probabilities

assigns a cluster id to each mentionensures transitivity automatically

Nonparametric Bayesian modeldoes not commit to a particular set of parameters

Model Parameters

)|( 3,

1 cmpmpmpP

)|( 6,

4 cmpmpmpP

)|( 7 cmpP

imp are the feature values

The ACE 2003 coreference corpus3 data sets (Broadcast News, Newswire, Newspaper)each has a training set and a test set; evaluate on test set only

Mentionssystem mentions (mentions extracted by an NP chunker)perfect mentions (mentions extracted from answer key)

Scoring programs: recall, precision, F-measureMUC scoring program (Vilain et al., 1995)CEAF scoring program (Luo, 2005)CEAF variant

same as CEAF, but ignores singleton clusters

Experimental Setup

Experimental SetupThe ACE 2003 coreference corpus

3 data sets (Broadcast News, Newswire, Newspaper)each has a training set and a test set; evaluate on test set only

Mentionssystem mentions (mentions extracted by an NP chunker)perfect mentions (mentions extracted from answer key)

FeaturesUse 7 linguistic features divided into 3 groups

)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

)|()|()( 6,

)|( 7ijij CmpP

3 mentions: 1, 2, 3

[123] [1][2][3]

3 mentions: 1, 2, 3

[123] [12][3][13][2] [1][23][1][2][3]

3 mentions: 1, 2, 3

[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.32 0.11 0.29 0.05

3 mentions: 1, 2, 3

[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.32 0.11 0.29 0.05

3 mentions: 1, 2, 3

How to cope with the computational complexity

of the E-step?

[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.32 0.11 0.29 0.05

Design a new model for unsupervised coreference resolution

Improve Haghighi and Klein’s model with three modifications

Evaluation ResultsBroadcast News

Recall: 53.1, Precision: 45.5, F-measure: 49.0

NewswireRecall: 57.2, Precision: 50.3, F-measure: 53.5

Evaluation ResultsBroadcast News

Recall: 53.1, Precision: 45.5, F-measure: 49.0

NewswireRecall: 57.2, Precision: 50.3, F-measure: 53.5

Can we improve performance by combining labeled and unlabeled data?

EM-based Generative Model H&K’s Generative Model

For each mention, guess the cluster id according to P(cluster id)

Generate feature values

Create mention pairsFor each pair, guess whether it

is COREF or NOT COREF according to P(COREF)

Generate feature values

Dirichlet Process

Generate new cluster ids as needed

Probability of generating some existing cluster id i is:

for some constant

1 iiclusterinalreadymentionsofnumber

higher probability for larger clusters

number of mentions already in cluster i

Dirichlet Process

Probability of generating some existing cluster id i is:

for some constant

Probability of generating some new cluster id is:

1 iiclusterinalreadymentionsofnumber

number of mentions already in cluster ihigher probability for larger clusters

Input: correct partition, system partition

2, 5, 8

System partitionCorrect partition

8, 11, 12

2, 6, 7

1, 4, 9

3, 5, 10

The CEAF Scoring Program

Recast the scoring problem as bipartite matching

2, 5, 8

8, 11, 12

2, 6, 7

1, 4, 9

3, 5, 10

Find the best matching using the Hungarian Algorithm

2, 5, 8

6, 11, 12

2, 7, 8

1, 4, 9

3, 5, 10

2, 5, 8

6, 11, 12

2, 7, 8

1, 4, 9

3, 5, 10

Matching score = 6

2, 5, 8

6, 11, 12

2, 7, 8

1, 4, 9

3, 5, 10

Matching score = 6

Recall = 6 / 9 = 0.66

Prec = 6 / 12 = 0.5

F-measure = 0.57

create one training instance for each pair of mentions from a training text feature vector: describes the two mentions

coref ?

not coref ?

coref ?

The probability of generating a particular cluster id is based on some distribution that specifies P(id=1), P(id=2), P(id=3), … but we don’t know the number of clusters a priori don’t know how many probabilities to specify for distribution a distribution over an unknown number of clusters

Dirichlet Process

Should we generate id 1 or 2, or should we generate a new id 3?

Dirichlet Process

Probability of generating some existing cluster id i is proportional to the number of mentions already in cluster i

Dirichlet Process

Probability of generating some existing cluster id i is proportional to the number of mentions already in cluster i

Dirichlet Process

Probability of generating some existing cluster id i is proportional to the number of mentions already in cluster I

Probability of generating some new cluster id is proportional to some constant α

Unsupervised Models for Coreference Resolution

Documents

Discourse, Pragmatics, Coreference Resolution

Deterministic Coreference Resolution Based on Entity ...jurafsky/pubs/coli_a_00152.pdfDeterministic Coreference Resolution Based on Entity-Centric, Precision-Ranked Rules Heeyoung

Coreference Resolution Seminar by Satpreet Arora (07D05003)

Coreference Resolution

Dynamic Knowledge-Base Alignment for Coreference Resolution

Unsupervised Coreference Resolution in a …nlp.cs.berkeley.edu/pubs/Haghighi-Klein_2007_Coreference...Unsupervised Coreference Resolution in a Nonparametric Bayesian Model Aria Haghighi

Cross-Caption Coreference Resolution for Automatic Image

Towards Coreference Resolution for Catalan and Spanish - CLiCclic.ub.edu/sites/default/files/users/dea-recasens.pdf · ABSTRACT Coreference resolution in Spanish has only been approached

Creating a Coreference Resolution System for Polish - …€¦ · · 2012-05-16Creating a Coreference Resolution System for Polish Mateusz Kope´c, ... First experiments on the

Coreference Resolution on Entities and Events for Hospital ... · Coreference resolution is the effort to find nouns and pronouns that refer to the same underlying entity. Most coreference

Mention Detection for Improving Coreference Resolution in Russian texts: A Machine ... · 2019. 9. 30. · Mention Detection for Improving Coreference Resolution in Russian Texts:

Coreference Resolution on Entities and Events for Hospital

BERT for Coreference Resolution - Stanford Universityweb.stanford.edu/class/cs224n/posters/15735157.pdf · 2019-12-20 · Improving coreference resolution by learning entity-level

Tutorial on Coreference Resolution

Coreference Resolution - NUS Computing - Homekanmy/courses/6101_1810/w9-coref.… · Coreference Resolution Daniel Biro, Joel Lee, Louis, Ding Feng, Mohit. 1.Introduction Daniel Biro

Simple Coreference Resolution with Rich Syntactic … › people › rkarthik › rkarthik_MS...1 Simple Coreference Resolution with Rich Syntactic and Semantic Features: Is it enough?

Entity Centric Coreference Resolution with Model Stacking

A Machine Learning Approach for Coreference Resolution€¦ · Noun phrase Coreference Resolution is a difﬁcult Natural Language Processing task. According to Vin-cent Ng [8], it

A Light-weight Approach to Coreference Resolution for Named

BERT for Coreference Resolution - Stanford University · Improving coreference resolution by learning entity-level distributed representations. In Proceedings of the 54th Annual Meeting