Unsupervised Models for Coreference Resolution

Preview:

DESCRIPTION

Unsupervised Models for Coreference Resolution. Vincent Ng Human Language Technology Research Institute University of Texas at Dallas. Plan for the Talk. Supervised learning for coreference resolution how and when supervised coreference research started - PowerPoint PPT Presentation

Citation preview

Unsupervised Models for Coreference Resolution

Vincent NgHuman Language Technology Research

InstituteUniversity of Texas at Dallas

2

Plan for the TalkSupervised learning for coreference resolution

how and when supervised coreference research startedstandard machine learning approach

3

Plan for the TalkSupervised learning for coreference resolution

how and when supervised coreference research startedstandard machine learning approach

Unsupervised learning for coreference resolutionself-training EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)

three modifications

4

Machine Learning for Coreference Resolutionstarted in mid-1990s

Connolly et al. (1994), Aone and Bennett (1995), McCarthy and Lehnert (1995)

propelled by availability of annotated corpora produced byMessage Understanding Conferences (MUC-6/7: 1995, 1998)

English onlyAutomatic Content Extraction (ACE 2003, 2004, 2005, 2008)

English, Chinese, Arabic

5

Machine Learning for Coreference Resolutionstarted in mid-1990s

Connolly et al. (1994), Aone and Bennett (1995), McCarthy and Lehnert (1995)

propelled by availability of annotated corpora produced byMessage Understanding Conferences (MUC-6/7: 1995, 1998)

English onlyAutomatic Content Extraction (ACE 2003, 2004, 2005, 2008)

English, Chinese, Arabic

identified as an important task for information extractionidentity coreference only

6

Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the

same real-world entity

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

7

Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the

same real-world entity

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

8

Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the

same real-world entity

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

9

Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the

same real-world entity

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

10

Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the

same real-world entity

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

11

Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the

same real-world entity

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

12

Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the

same real-world entity

Lots of prior work on supervised coreference resolution

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

13

Standard Supervised Learning ApproachClassification

a classifier is trained to determine whether two mentions are coreferent or not coreferent

14

Standard Supervised Learning ApproachClassification

a classifier is trained to determine whether two mentions are coreferent or not coreferent

[Queen Elizabeth] set about transforming [her] [husband], ...

coref ?

not coref ?

coref ?

15

Standard Supervised Learning ApproachClustering

coordinates possibly contradictory pairwise coreference decisions

husband

King George VI

the King

his

Clustering Algorithm

Queen Elizabeth

her

Logue

a renowned speech therapist

Queen Elizabeth

Logue

[Queen Elizabeth],

set about transforming

[her]

[husband]

...

coref

not coref

not

coref

King George VI

16

Standard Supervised Learning ApproachClustering

coordinates possibly contradictory pairwise classification decisions

husband

King George VI

the King

his

Clustering Algorithm

Queen Elizabeth

her

Logue

a renowned speech therapist

Queen Elizabeth

Logue

[Queen Elizabeth],

set about transforming

[her]

[husband]

...

coref

not coref

not

coref

King George VI

17

Standard Supervised Learning ApproachClustering

coordinates possibly contradictory pairwise classification decisions

husband

King George VI

the King

his

Clustering Algorithm

Queen Elizabeth

her

Logue

a renowned speech therapist

Queen Elizabeth

Logue

[Queen Elizabeth],

set about transforming

[her]

[husband]

...

coref

not coref

not

coref

King George VI

18

Standard Supervised Learning ApproachTypically relies on a large amount of labeled data

What if we only have a small amount of annotated data?

19

First Attempt: Supervised Learningtrain on whatever annotated data we have

need to specify learning algorithm feature setclustering algorithm

20

First Attempt: Supervised Learningtrain on whatever annotated data we have

need to specify learning algorithm (Bayes)feature setclustering algorithm (Bell-tree)

21

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

22

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

Coref, Not Coref

23

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

Coref, Not Coref

)...,,,|(maxarg 21*

nYy xxxyPy

24

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

Coref, Not Coref

)...,,,|(maxarg 21*

nYy xxxyPy )|...,,,()(maxarg 21 yxxxPyP nYy

25

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

What features to use in the feature representation?

Coref, Not Coref

)...,,,|(maxarg 21*

nYy xxxyPy )|...,,,()(maxarg 21 yxxxPyP nYy

26

Linguistic FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Pair Type (ti, tj), where ti, tj { Pronoun, Name, Nominal }

27

Linguistic FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Pair Type (ti, tj), where ti, tj { Pronoun, Name, Nominal }

28

Linguistic FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Pair Type (ti, tj), where ti, tj { Pronoun, Name, Nominal }

29

Linguistic FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Pair Type (ti, tj), where ti, tj { Pronoun, Name, Nominal }

30

Use 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Pair Type (ti, tj), where ti, tj { Pronoun, Name, Nominal }

Linguistic Features

E.g., for the mention pair (Barack Obama, president-elect), the feature value is (Name, Nominal)

31

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy )|...,,,()(maxarg 721 yxxxPyPYy

COREF or NOT COREF

32

finds the class value y that is the most probable given the feature vector x1,..,xn

finds y* such that

But we may have a data sparseness problem

The Bayes Classifier

)...,,,|(maxarg 21*

nYy xxxyPy )|...,,,()(maxarg 721 yxxxPyPYy

COREF or NOT COREF

33

finds the class value y that is the most probable given the feature vector x1,..,xn

finds y* such that

But we may have a data sparseness problem

Let’s simplify this term!

The Bayes Classifier

)...,,,|(maxarg 21*

nYy xxxyPy )|...,,,()(maxarg 721 yxxxPyPYy

COREF or NOT COREF

34

finds the class value y that is the most probable given the feature vector x1,..,xn

finds y* such that

But we may have a data sparseness problem

Let’s simplify this term!assume that feature values from different groups are

independent of each other given the class

The Bayes Classifier

)...,,,|(maxarg 21*

nYy xxxyPy )|...,,,()(maxarg 721 yxxxPyPYy

COREF or NOT COREF

35

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy)|...,,,()(maxarg 721 yxxxPyPYy

COREF or NOT COREF

36

)|...,,,()(maxarg 721 yxxxPyPYy

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy

These are the model parameters (to be estimated from annotated data using maximum likelihood estimation)

COREF or NOT COREF

37

)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy)|...,,,()(maxarg 721 yxxxPyPYy

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

Generative model: specifies how an instance is generated

COREF or NOT COREF

38

)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy)|...,,,()(maxarg 721 yxxxPyPYy

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

Generative model: specifies how an instance is generated

Generate the class y with P(y)

COREF or NOT COREF

39

)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy)|...,,,()(maxarg 721 yxxxPyPYy

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

Generative model: specifies how an instance is generated

Generate the class y with P(y) Given y, generate

x1, x2, and x3 with P(x1, x2, x3 | y)

COREF or NOT COREF

40

)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy)|...,,,()(maxarg 721 yxxxPyPYy

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

Generative model: specifies how an instance is generated

Generate the class y with P(y) Given y, generate

x1, x2, and x3 with P(x1, x2, x3 | y)

Given y, generate x4, x5, and x6 with

P(x4, x5, x6 | y)

COREF or NOT COREF

41

)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy)|...,,,()(maxarg 721 yxxxPyPYy

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

Generative model: specifies how an instance is generated

Generate the class y with P(y) Given y, generate

x1, x2, and x3 with P(x1, x2, x3 | y)

Given y, generate x4, x5, and x6 with

P(x4, x5, x6 | y)

Given y, generate x7 with P(x7 | y)

COREF or NOT COREF

42

train on whatever annotated data we have

need to specify learning algorithm feature set clustering algorithm

First Attempt: Supervised Learning

43

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

44

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

45

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

[12]

[1][2]

46

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

[12]

[1][2]

[123]

[12][3]

47

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

48

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

49

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

Leaves contain all the possible partitions of all of the mentions

50

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

Leaves contain all the possible partitions of all of the mentions

Computationally infeasible to expand all nodes in the Bell tree

51

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

expands only the most promising nodes

52

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

expands only the most promising nodes

How to determine which nodes are promising?

53

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

54

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

1

55

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

[12]

[1][2]

1

56

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

[12]

[1][2]

1

1 * Pc(1,2) = 1 * 0.6 = 0.6

57

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

[12]

[1][2]

1

0.6

1 * (1 - Pc(1,2)) = 1 * (1 - 0.6) = 0.4

58

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

[12]

[1][2]

1

0.6

0.4

59

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

[12]

[1][2]

1

0.6

0.4

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

60

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

[12]

[1][2]

1

0.6

0.4

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

0.6 * max (Pc(1,3), Pc(2,3)) = 0.6 * max(0.2, 0.7) = 0.42

61

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

[12]

[1][2]

1

0.6

0.4

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

0.42

0.58

0.08

0.28

0.12

62

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

[12]

[1][2]

1

0.6

0.4

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

0.42

0.58

0.08

0.28

0.12

63

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

[12]

[1][2]

1

0.6

0.4

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

0.42

0.58

0.08

0.28

0.12

expands only the N most probable nodes at each level

64

Where are we?We have described

a learning algorithm for training a coreference classifier a clustering algorithm for combining coreference probabilities

65

Where are we?We have described

a learning algorithm for training a coreference classifier a clustering algorithm for combining coreference probabilities

Goal: evaluate this coreference system in the presence of a small amount of labeled data

66

The ACE 2003 coreference corpus3 data sets (Broadcast News, Newswire, Newspaper)

each has a training set and a test set use one training text for training the coreference classifier evaluate on the entire test set

Mentions extracted automatically using an NP chunker

Experimental Setup

67

The ACE 2003 coreference corpus3 data sets (Broadcast News, Newswire, Newspaper)

each has a training set and a test set use one training text for training the coreference classifier evaluate on the entire test set

Mentions extracted automatically using an NP chunker

Scoring programCEAF scoring program (Luo, 2005)

recall, precision, F-measure

Experimental Setup

68

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Evaluation Results

69

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Evaluation Results

70

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Evaluation Results

71

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Evaluation Results

72

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Evaluation Results

Can we improve performance by combining a small amount of labeled data and

a potentially large amount of unlabeled data?

73

Supervised learning for coreference resolutionbrief historystandard machine learning approach

Unsupervised learning for coreference resolutionself-training EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)

three modifications

Plan for the Talk

74

Self-Training

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

75

Self-Training

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

Classifier h

76

Self-Training

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

Classifier h

77

Self-Training

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

Classifier h

N most confidently labeled instances

78

Results (F-measure for Self-Training)

43

45

47

49

51

53

55

0 1 2 3 4 5 6 7 8 9

Number of Iterations

w/o bagging

43

45

47

49

51

53

55

0 1 2 3 4 5 6 7 8 9

Number of Iterations

w/o bagging

Broadcast News Newswire

79

Why doesn’t Self-Training improve?only the most confidently labeled instances are added in

each iterationthe classifier already knows how to label these newly added

instancesnot much new knowledge is gained by re-training a classifier

from such newly added instances

80

Why does Self-Training hurt?also due to the bias towards confidently-labeled instances

many confidently labeled instances are pairs of identical proper names

(India, India) (IBM, IBM)

(prince, prince) (Clinton, Clinton)

Coref Coref Coref Coref

81

Why does Self-Training hurt?also due to the bias towards confidently-labeled instances

many confidently labeled instances are pairs of identical proper names

(India, India) (IBM, IBM)

(prince, prince) (Clinton, Clinton)

Coref Coref Coref Coref

(name, name) (name, name) (name, name) (name, name)

Mention Pair Type feature value

82

Why does Self-Training hurt?also due to the bias towards confidently-labeled instances

many confidently labeled instances are pairs of identical proper names

the classifier gradually learns that two proper names are likely to be coreferent, regardless of whether the names are identical

(India, India) (IBM, IBM)

(prince, prince) (Clinton, Clinton)

(name, name) (name, name) (name, name) (name, name)

Mention Pair Type feature value

Coref Coref Coref Coref

83

Why does Self-Training hurt?Since we hypothesize that the Mention Pair Type feature is

causing the problem …repeat the experiments without using this feature

84

Results (F-measure for Self-Training)

43

45

47

49

51

53

55

0 1 2 3 4 5 6 7 8 9

Number of Iterations

no MP Type feature with MP Type feature

43

45

47

49

51

53

55

0 1 2 3 4 5 6 7 8 9

Number of Iterations

no MP Type feature with MP Type feature

Broadcast News Newswire

85

Some Lessons Learnedwhen labeled data is scarce, feature design becomes an

important issue

when exploiting unlabeled data, it is crucial to learn from both confidently labeled and not-so-confidently labeled data

86

Supervised learning for coreference resolutionbrief historystandard machine learning approach

Unsupervised learning for coreference resolutionself-training EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)

three modifications

Plan for the Talk

87

Unsupervised Coreference as EM Clustering

Exploits unlabeled data by inducing a clustering for an unlabeled document, not by labeling mention pairs

88

Unsupervised Coreference as EM Clustering

Exploits unlabeled data by inducing a clustering for an unlabeled document, not by labeling mention pairs

the EM-based model is forced to learn from all of the mention pairs when the model is retrained

89

Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,

where Cij = 1 iff mentions i and j are coreferent

90

Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,

where Cij = 1 iff mentions i and j are coreferent1 2 3 4 5

1

2

3

4

5

91

Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,

where Cij = 1 iff mentions i and j are coreferent1 2 3 4 5

1

2

3

4

5

Coreferent

92

Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,

where Cij = 1 iff mentions i and j are coreferent

Not Coreferent

1 2 3 4 5

1

2

3

4

5

93

A clustering C of n mentions is an n x n Boolean matrix, where Cij = 1 iff mentions i and j are coreferent

Representing a Clustering

Don’t care about diagonal entries

1 2 3 4 5

1

2

3

4

5

94

Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,

where Cij = 1 iff mentions i and j are coreferent

Don’t care about entries below the diagonal

1 2 3 4 5

1

2

3

4

5

95

A clustering C of n mentions is an n x n Boolean matrix, where Cij = 1 iff mentions i and j are coreferent

1 2 3 4 5

1

2

3

4

5

Representing a Clustering

Transitive

96

Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,

where Cij = 1 iff mentions i and j are coreferent

Valid

1 2 3 4 5

1

2

3

4

5

97

1 2 3 4 5

1

2

3

4

5

Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,

where Cij = 1 iff mentions i and j are coreferent

Valid Invalid

1 2 3 4 5

1

2

3

4

5

98

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

)|()(),( CDPCPCDP

99

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

How to generate D given C?

)|()(),( CDPCPCDP

100

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

How to generate D given C? Assume that D is represented by its mention pairs

)|()(),( CDPCPCDP

101

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

How to generate D given C? Assume that D is represented by its mention pairs To generate D, generate all pairs of mentions in D

(Queen Elizabeth, her), (Queen Elizabeth, husband), (Queen Elizabeth, King George VI), …

)|()(),( CDPCPCDP

102

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

103

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

mpij is the pair formed from mention i and mention j

104

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

Let’s simplify this term

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

105

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

Let’s simplify this term assume that each mention pair mpij is generated

conditionally independently given C ij

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

106

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

)|()(),( CDPCPCDP

)|()()(

DPairs ijij CmpPCP

)|()( ...,14,13,12 CmpmpmpPCP

107

)|()()(

DPairs ijij CmpPCP

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

How to represent a mention pair mij?

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

108

Linguistic FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }

109

Given a document D,generate a clustering C according to P(C)generate D given C

)|()()(

DPairs ijij CmpPCP

The Generative Model

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

110

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

7 feature values

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

2,

1DPairs ijijijij CmpmpmpPCP

111

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

Let’s simplify this term

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

2,

1DPairs ijijijij CmpmpmpPCP

112

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

Let’s simplify this term assume that feature values from different groups are

conditionally independent of each other

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

2,

1DPairs ijijijij CmpmpmpPCP

113

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

2,

1DPairs ijijijij CmpmpmpPCP

)|()|()( 6,

5,

43,

2,

1ijijijijijijijij CmpmpmpPCmpmpmpPCP

)|( 7ijij CmpP

114

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

2,

1DPairs ijijijij CmpmpmpPCP

)|()|()( 6,

5,

43,

2,

1ijijijijijijijij CmpmpmpPCmpmpmpPCP

)|( 7ijij CmpP

115

)|( 7 cmpP

Model Parameters

)|( 3,

2,

1 cmpmpmpP

)|( 6,

5,

4 cmpmpmpPimp are the feature values

{ Coref, Not Coref }c

116

)|( 7 cmpP

Model Parameters

)|( 3,

2,

1 cmpmpmpP

)|( 6,

5,

4 cmpmpmpPimp are the feature values

{ Coref, Not Coref }c

117

)|( 7 cmpP

Model Parameters

)|( 3,

2,

1 cmpmpmpP

)|( 6,

5,

4 cmpmpmpPimp are the feature values

{ Coref, Not Coref }c

118

Model Parameters

)|( 3,

2,

1 cmpmpmpP

)|( 6,

5,

4 cmpmpmpP

)|( 7 cmpP

imp are the feature values

{ Coref, Not Coref }c

If we had labeled data, we could estimate the parametersBut we don’t have labeled data. So …

119

Model ParametersUse EM to iteratively

estimate the model parametersprobabilistically induce a clustering for a document

120

The Induction Algorithm

Given a set of unlabeled documents

121

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

122

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

Initial labelings are presumably noisy

123

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

124

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

125

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

3 mentions: 1, 2, 3

126

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

3 mentions: 1, 2, 3

[123] [12][3][13][2] [1][23][1][2][3] + invalid clusterings

127

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

3 mentions: 1, 2, 3

+ invalid clusterings

[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.21 0.11 0.29 0.05 …

128

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

3 mentions: 1, 2, 3

+ invalid clusterings

[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.21 0.11 0.29 0.05 …

Iterate till convergence

129

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

3 mentions: 1, 2, 3

+ invalid clusterings

[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.21 0.11 0.29 0.05 …

Iterate till convergence

How to cope with the computational complexity

of the E-step?

130

Approximating the E-step

Search for the N most probable clusterings onlyusing the Bell Tree algorithm

131

Approximating the E-step

Search for the N most probable clusterings onlyusing the Bell Tree algorithm

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

132

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions of each document (E-step) use the normalized scores of the 50-best clusterings

The Induction Algorithm

Iterate till convergence

133

Supervised learning for coreference resolutionbrief historystandard machine learning approach

Unsupervised learning for coreference resolutionself-training EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)

three modifications

Plan for the Talk

134

Haghighi and Klein’s ModelCluster-level model

assigns a cluster id to each mention

135

Haghighi and Klein’s ModelCluster-level model

assigns a cluster id to each mention

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

136

Haghighi and Klein’s ModelCluster-level model

assigns a cluster id to each mention

1Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

137

Haghighi and Klein’s ModelCluster-level model

assigns a cluster id to each mention

1 1Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

138

Haghighi and Klein’s ModelCluster-level model

assigns a cluster id to each mention

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

1 1 2

139

Haghighi and Klein’s ModelCluster-level model

assigns a cluster id to each mention

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

1 1 2

2 3

4

2 2 5

4

140

Haghighi and Klein’s ModelCluster-level model

assigns a cluster id to each mentionensures transitivity automatically

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

1 1 2

2 3

4

2 2 5

4

141

Haghighi and Klein’s Generative Story

142

Haghighi and Klein’s Generative StoryFor each mention encountered in a document,

generate a cluster id for the mention (according to some cluster id distribution)

generate the head noun of the mention (according to some cluster-specific head distribution)

143

Haghighi and Klein’s Generative StoryFor each mention encountered in a document,

generate a cluster id for the mention (according to some cluster id distribution)

generate the head noun of the mention (according to some cluster-specific head distribution)

Inference: Gibbs sampling

144

Haghighi and Klein’s Generative StoryFor each mention encountered in a document,

generate a cluster id for the mention (according to some cluster id distribution)

generate the head noun of the mention (according to some cluster-specific head distribution)

Inference: Gibbs sampling

Problem with the model: Too simplistic!mentions with the same head likely to get the same cluster

id

145

Haghighi and Klein’s Generative StoryFor each mention encountered in a document,

generate a cluster id for the mention (according to some cluster id distribution)

generate the head noun of the mention (according to some cluster-specific head distribution)

Inference: Gibbs sampling

Problem with the model: Too simplistic!mentions with the same head likely to get the same cluster

id two occurrences of “she” will likely be posited as coreferent particularly inappropriate for generating pronouns

146

Haghighi and Klein’s Generative StoryFor each mention encountered in a document,

generate a cluster id for the mention (according to some cluster id distribution)

generate the head noun of the mention (according to some cluster-specific head distribution)

Inference: Gibbs sampling

Problem with the model: Too simplistic!mentions with the same head likely to get the same cluster id

Extensions:use a separate “pronoun head model” to generate pronounsincorporate salience

147

Supervised learning for coreference resolutionbrief historystandard machine learning approach

Unsupervised learning for coreference resolutionself-training and its variant (Ng and Cardie, 2003)EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)

three modifications relaxed head generation agreement constraints pronoun-only salience

Plan for the Talk

148

Modification 1: Relaxed Head GenerationMotivation

H&K’s model is linguistically impoverished does not exploit useful knowledge: alias, appositives, …

149

Modification 1: Relaxed Head GenerationMotivation

H&K’s model is linguistically impoverished does not exploit useful knowledge: alias, appositives, …

Goalsimple method for incorporating such knowledge sources

150

Modification 1: Relaxed Head Generation

pre-process a document by assigning a “head id” to each mention, such that two mentions have the same head id iffthey are the same stringor they are aliasesor they are in an appositive relation

151

Modification 1: Relaxed Head Generation

pre-process a document by assigning a “head id” to each mention, such that two mentions have the same head id iffthey are the same stringor they are aliasesor they are in an appositive relation

International Business

Corporation

IBM

Barcelona

1

1

2

152

Modification 1: Relaxed Head Generation

pre-process a document by assigning a “head id” to each mention, such that two mentions have the same head id iffthey are the same stringor they are aliasesor they are in an appositive relation

instead of generating the head noun, generate the head id

International Business

Corporation

IBM

Barcelona

1

1

2

153

Modification 1: Relaxed Head Generation

pre-process a document by assigning a “head id” to each mention, such that two mentions have the same head id iffthey are the same stringor they are aliasesor they are in an appositive relation

instead of generating the head noun, generate the head idthe model views “International Business Corporation” and “IBM”

as two mentions having the same head

International Business

Corporation

IBM

Barcelona

1

1

2

154

Modification 1: Relaxed Head Generation

pre-process a document by assigning a “head id” to each mention, such that two mentions have the same head id iffthey are the same stringor they are aliasesor they are in an appositive relation

instead of generating the head noun, generate the head idthe model views “International Business Corporation” and “IBM”

as two mentions having the same headencourages the model to put the two into the same cluster

International Business

Corporation

IBM

Barcelona

1

1

2

155

Modification 2: Agreement ConstraintsMotivation

gender and number agreement is implemented as a preference, not as a constraint, in H&K’s model

156

Modification 2: Agreement ConstraintsMotivation

gender and number agreement is implemented as a preference, not as a constraint, in H&K’s model

while the model favours the assignment of a pronoun to a gender- and number-compatible cluster

it also favours the assignment of a pronoun to a large cluster

157

Modification 2: Agreement ConstraintsMotivation

gender and number agreement is implemented as a preference, not as a constraint, in H&K’s model

while the model favours the assignment of a pronoun to a gender- and number-compatible cluster

it also favours the assignment of a pronoun to a large cluster

if a cluster is large enough, the model may assign the pronoun to the cluster even if the two are not compatible

158

Modification 2: Agreement ConstraintsMotivation

gender and number agreement is implemented as a preference, not as a constraint, in H&K’s model

while the model favours the assignment of a pronoun to a gender- and number-compatible cluster

it also favours the assignment of a pronoun to a large cluster

if a cluster is large enough, the model may assign the pronoun to the cluster even if the two are not compatible

Goalimplement gender and number agreement as a constraint

159

disallow the generation of a mention by any cluster where the two are incompatible in number or gender

Modification 2: Agreement Constraints

160

Modification 3: Pronoun-Only Salience

In H&K’s model, salience is applied to all types of mentions (pronouns, names and nominals) during cluster assignment

Our hypothesissince names and nominals are less sensitive to salience, the

net benefit of applying salience to names and nominals could be negative as a result of inaccurate modeling of salience

We restrict the application of salience to pronouns only

161

Improving Haghighi and Klein’s Model3 modifications

relaxed head generationagreement constraintspronoun-only salience

162

EvaluationEM-based model

Haghighi and Klein’s modelwith and without the 3 modifications

163

The ACE 2003 coreference corpus3 data sets (Broadcast News, Newswire, Newspaper)

For each data set use one training text for initializing model parameters evaluate on the entire test set

Mentions extracted automatically using an NP chunker

Scoring programCEAF scoring program (Luo, 2005)

Experimental Setup

164

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Results (Weakly Supervised Baseline)

Train the Bayes classifier on one (labeled) document

Use the Bell Tree clustering algorithm to impose a partition for each test document using the pairwise probabilities

165

Heuristic BaselineSimple rule-based system

Posits two mentions as coreferent if and only if they arethe same stringaliasesin an appositive relation

166

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Results (Heuristic Baseline)

167

EM-Based ModelInitialize the parameters using one (labeled) document

rather than using randomly guessed clusterings

168

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Results (EM-Based Model)

169

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Results (EM-Based Model)

gains in both recall and precisionF-measure increases by 5-7%

170

Duplicated Haghighi and Klein’s Model

Use the same labeled document as in the EM-based model to learn the value of in the Dirichlet Process

171

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Results (Duplicated H&K’s Model)

172

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Results (Duplicated H&K’s Model)

In comparison to EM-based modelprecision drops substantiallyF-measure decreases by 10-11%

173

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Results (Adding 3 Modifications)

174

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Results (Adding 3 Modifications)

In comparison to Duplicated Haghighi and KleinF-measure improves after the addition of each modification

175

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Results (Adding 3 Modifications)

In comparison to Duplicated Haghighi and KleinF-measure improves after the addition of each modificationmodest gain in recall and substantial gain in precision when

all modifications are applied (9-10% gain in F-measure)

176

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Results (Fully-Supervised Resolver)

Trained using C4.5, entire ACE training set, 34 featuresOutperforms the unsupervised models by 7%

177

Using a Knowledge-Based FeatureAdd a feature to the EM-based model that encodes the

output of a knowledge-based coreference systemimplements heuristics used by different MUC-7 resolvers

Resulting model not so “unsupervised”

178

Broadcast News Newswire Experiments on System Mentions

R P F R P F

EM-based Model (w/ KB feature) 65.4 53.3 58.8 68.1 58.2 62.8

EM-based Model (w/o KB feature) 57.0 54.6 55.7 62.9 56.5 59.6

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Results (EM-Based Model w/ KB Feature)

179

SummaryExamined unsupervised models for coreference resolution

self-training, EM, Haghighi and Klein’s model require little labeled datafacilitates their application to resource-scarce languages

EM-based model and modified H&K’s model outperform self-training and H&K’s original model

Not as competitive as fully-supervised model, but …

180

Summary (Cont’)… they can potentially be improved by

incorporating additional linguistic features in

feature engineering remains a challenging issuecombining a large amount of labeled data with a large amount

of unlabeled data

generative modeling is interesting in itself

181

SummaryExamined unsupervised models for coreference resolution

self-training, EM, Haghighi and Klein’s model require little labeled datafacilitates their application to resource-scarce languages

Self-training with and without baggingDoesn’t improve (and sometimes even hurts) performanceAugment labeled data with only confidently-labeled instancesLittle knowledge is gained by the classifierCareful feature design is an especially important issueNeed to label both confident and not-so-confident instances

182

Summary (Cont’)EM-based generative model

induces a clustering on an unlabeled documentoutperforms Haghighi and Klein’s coreference model

Three extensions to Haghighi and Klein’s generative model each modification improves F-measure

Not as competitive as fully-supervised modelbut … generative modeling is interesting in itselffeature engineering remains a crucial yet challenging issue

183

Weakly Supervised BaselineTrain the Naïve Bayes classifier on one (labeled) document

Use the Bell Tree clustering algorithm to impose a partition on each test document using the pairwise probabilities

184

The ACE 2003 coreference corpus3 data sets (Broadcast News, Newswire, Newspaper)

each has a training set and a test set use one training text for training the Bayes coreference classifier evaluate on the entire test set

Mentions extracted automatically using an NP chunker

Scoring program MUC scoring program (Vilain et al., 1995) ????

Experimental Setup

185

The ACE 2003 coreference corpus3 data sets (Broadcast News, Newswire, Newspaper)

each has a training set and a test set use one training text for training the Bayes coreference classifier evaluate on the entire test set

Mentions extracted automatically using an NP chunker

Scoring program MUC scoring program (Vilain et al., 1995) ????

2 problems under-penalizes partitions where mentions are over-clustered does not reward successful identification of singleton clusters

Experimental Setup

186

)|...,,,()(maxarg 721 yxxxPyPYy

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy

These are the model parameters (to be estimated from annotated data using maximum likelihood estimation)

Not as naïve as Naïve Bayes …

COREF or NOT COREF

187

Results (Self-Training w/ and w/o Bagging)

37

39

41

43

45

47

49

51

53

55

0 1 2 3 4 5 6 7 8 9

Number of Iterations

w/ bagging (5 bags) w/o bagging

37

39

41

43

45

47

49

51

53

55

0 1 2 3 4 5 6 7 8 9

Number of Iterations

w/ bagging (5 bags) w/o bagging

Broadcast News Newswire

188

Self-Training with Bagging

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

189

Self-Training with Bagging

Create k training sets, each of size |L|, by sampling from L with replacement

Train k classifiers

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

190

Self-Training with Bagging

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

Bagged Classifier h1

Bagged Classifier h2

Bagged Classifier hk

191

Self-Training with Bagging

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

Bagged Classifier h1

Bagged Classifier h2

Bagged Classifier hk

192

Self-Training with Bagging

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

Bagged Classifier h1

Bagged Classifier h2

Bagged Classifier hk

N labeled instances with the highest average confidence

193

Why doesn’t Self-Training improve?only the most confidently labeled instances are added in

each iterationthe classifier already knows how to label these newly added

instancesnot much new knowledge is gained by re-training a classifier

from such newly added instances

Need to learn from both the confidently and no-so-confidently labeled instances

194

Haghighi and Klein’s ModelNonparametric Bayesian model

195

Haghighi and Klein’s ModelNonparametric Bayesian model

Enables the use of prior knowledge to put a higher probability on hypotheses deemed more likely

Don’t commit to a particular set of parameters (don’t attempt to compute the most likely hypothesis)

196

Haghighi and Klein’s ModelNonparametric Bayesian model

Enables the use of prior knowledge to put a higher probability on hypotheses deemed more likely

Don’t commit to a particular set of parameters (don’t attempt to compute the most likely hypothesis)

197

Haghighi and Klein’s ModelNonparametric Bayesian model

Given a set of mentions X, find the most likely partition Z. Find the Z that maximizes

Enables the use of prior knowledge to put a higher probability on hypotheses deemed more likely

Don’t commit to a particular set of parameters (don’t attempt to compute the most likely hypothesis)

dXPXZPXZP )|(),|()|(

198

Haghighi and Klein’s ModelNonparametric Bayesian model

Given a set of mentions X, find the most likely partition Z. Find the Z that maximizes

Enables the use of prior knowledge to put a higher probability on hypotheses deemed more likely

Don’t commit to a particular set of parameters (don’t attempt to compute the most likely hypothesis)

dXPXZPXZP )|(),|()|(

Integrate out the parameters

Encode prior knowledge on hypotheses

199

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

200

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

expands only the most promising paths

201

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

expands only the most promising paths

How to determine which paths are promising?

202

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

[12]

[1][2]

1

0.6

0.4

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

0.6*(1- max (Pc(1,3), Pc(2,3))) = 0.6 * (1- max(0.2, 0.7)) = 0.58

0.42

203

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

[1]

[12]

[1][2]

1

0.6

0.4

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

0.42

0.58

204

Plan for the TalkSupervised learning for coreference resolution

brief historystandard machine learning approach

Unsupervised learning for coreference resolutionself-training and its variant (Ng and Cardie, 2003)EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)

three modifications

205

Standard Supervised Learning ApproachClassification

given a description of two mentions, mi and mj, classify the pair as coreferent or not coreferent

create one training instance for each pair of mentions from texts annotated with coreference information feature vector: describes the two mentions

train a classifier using a machine learning algorithm decision tree learner (C5), maximum entropy, SVMs

[Queen Elizabeth] set about transforming [her] [husband], ...

coref ?

not coref ?

coref ?

206

Related WorkApply a weakly supervised or unsupervised learning

algorithm to pronoun resolution

co-training (Müller et al., 2002)

self-training (Kehler et al., 2004)

207

Linguistic FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }

Heuristics

208

Linguistic FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }

How to compute the semantic class of a mention?

209

Linguistic FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }

How to compute the semantic class of a mention? Proper names: use a named entity recognizer Nominals: induced from an unannotated corpus

210

Inducing Semantic ClassesGoal: induce the semantic class of a nominal, focusing on

PERSON, ORGANIZATION, LOCATION, and OTHERS

211

Inducing Semantic ClassesGoal: induce the semantic class of a nominal, focusing on

PERSON, ORGANIZATION, LOCATION, and OTHERS

Given a large, unannotated corpus

Use a parser to extract appositive relations <Eastern Airlines, carrier>, <George Bush, president>, …

Use a named entity recognizer to find the semantic classes of the proper names

Infer the semantic class of a nominal from the associated proper name

212

Potential Problems Named entity recognizer is not perfect

Mislabels proper names

Parser is not perfect Extracts mention pairs that are not in apposition

213

Potential Problems Named entity recognizer is not perfect

Mislabels proper names

Parser is not perfect Extracts mention pairs that are not in apposition

To improve robustness:1. Compute the probability that the nominal co-occurs with each

of the named entity types

2. If the most likely NE type has a probability above 0.7, label the nominal with the most likely NE type

214

Broadcast News Newswire Experiments on System Mentions

MUC CEAF MUC CEAF

Weakly Supervised Baseline 38.0 49.0 42.8 53.5

Heuristic Baseline 36.4 48.4 43.2 54.2

Our EM-based Model 51.6 55.7 57.8 59.6

Duplicated Haghighi and Klein 45.2 45.2 41.9 48.8

+ Relaxed Head Generation 47.0 47.5 45.0 52.6

+ Agreement Constraints 48.9 51.4 46.0 54.5

+ Pronoun-only Salience 52.6 54.7 50.0 57.4

Fully Supervised Model 60.4 61.8 60.6 64.5

MUC and CEAF F-Scores

215

Broadcast News Newswire Experiments on System Mentions

MUC CEAF MUC CEAF

Weakly Supervised Baseline 38.0 49.0 42.8 53.5

Heuristic Baseline 36.4 48.4 43.2 54.2

Our EM-based Model 51.6 55.7 57.8 59.6

Duplicated Haghighi and Klein 45.2 45.2 41.9 48.8

+ Relaxed Head Generation 47.0 47.5 45.0 52.6

+ Agreement Constraints 48.9 51.4 46.0 54.5

+ Pronoun-only Salience 52.6 54.7 50.0 57.4

Fully Supervised Model 60.4 61.8 60.6 64.5

MUC and CEAF F-Scores

216

Broadcast News Newswire Experiments on System Mentions

MUC CEAF MUC CEAF

Weakly Supervised Baseline 38.0 49.0 42.8 53.5

Heuristic Baseline 36.4 48.4 43.2 54.2

Our EM-based Model 51.6 55.7 57.8 59.6

Duplicated Haghighi and Klein 45.2 45.2 41.9 48.8

+ Relaxed Head Generation 47.0 47.5 45.0 52.6

+ Agreement Constraints 48.9 51.4 46.0 54.5

+ Pronoun-only Salience 52.6 54.7 50.0 57.4

Fully Supervised Model 60.4 61.8 60.6 64.5

MUC and CEAF F-Scores

217

Broadcast News Newswire Experiments on System Mentions

MUC CEAF MUC CEAF

Weakly Supervised Baseline 38.0 49.0 42.8 53.5

Heuristic Baseline 36.4 48.4 43.2 54.2

Our EM-based Model 51.6 55.7 57.8 59.6

Duplicated Haghighi and Klein 45.2 45.2 41.9 48.8

+ Relaxed Head Generation 47.0 47.5 45.0 52.6

+ Agreement Constraints 48.9 51.4 46.0 54.5

+ Pronoun-only Salience 52.6 54.7 50.0 57.4

Fully Supervised Model 60.4 61.8 60.6 64.5

MUC and CEAF F-Scores

Similar performance trends across the 2 scoring programs

218

Experiments using Perfect MentionsPerfect mentions are NPs marked up in the answer key

using them makes the coreference task somewhat easier

Similar performance trends observedexcept that the unsupervised models perform comparably to

the fully-supervised resolver

Conclusions drawn from system mentions are not always generalizable to perfect mentions and vice versa

219

SummaryPresented an EM-based model for unsupervised

coreference resolution that outperforms Haghighi and Klein’s coreference model

compares favourably to a modified version of their model

220

H&K’s Model: Salience ModelingEach entity/cluster is initially assigned a salience value of 0As we process the discourse, the salience value of each

entity will changeWhen we encounter a mention, we update the salience scores

(* 0.5 for each entity and add 1 to current entity)Then discretize the salience values

5 buckets: TOP, HIGH, MID, LOW, NONEUsing a separate corpus, estimate the probability of

P(mention type | Salience)where mention type can be pronoun, name, or nominal. E.g.,

P(pronoun | TOP) is a large value P(nominal | TOP) is a small value

model is sensitive to these estimated values

221

Why Salience Modeling?Important for pronouns

For H&K, since they don’t use features like apposition, modeling salience may allow mentions in an appositive to be assigned the same cluster id.

222

Parameter Initialization = 0.4 (true mention) and 0.7 (system mentions) concentration parameter: e-4

223

Parameter Initialization

Uses one (labeled) document taken from the training set toinitialize the parameters of our EM-based modeldetermine the concentration parameter, , in H&K’s model

224

Experiments with Perfect MentionsSimilar performance trends observed

except that the unsupervised models perform comparably to the fully-supervised resolver

Conclusions drawn from perfect mentions are not always generalizable to system mentions and vice versa

Results obtained using perfect mentions should not be compared against those obtained using system mentions

225

Degenerate EM BaselineModel obtained after one iteration of EM

No parameter re-estimation on the unlabeled data

226

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Heuristic Baseline 30.9 44.3 36.4 36.3 53.4 43.2

Degenerate EM Baseline 70.8 36.3 48.0 69.0 25.1 36.8

Our EM-based Model 42.4 66.0 51.6 55.2 60.6 57.8

Haghighi and Klein Baseline 50.8 40.7 45.2 43.0 40.9 41.9

+ Relaxed Head Generation 48.3 45.7 47.0 40.9 50.0 45.0

+ Agreement Constraints 50.4 47.5 48.9 41.7 51.2 46.0

+ Pronoun-only Salience 52.2 53.0 52.6 44.3 57.3 50.0

Fully Supervised Model 53.0 70.3 60.4 53.1 70.5 60.6

Degenerate EM Baseline: MUC Results

227

Degenerate EM Baseline: MUC ResultsBroadcast News Newswire

Experiments on System Mentions R P F R P F

Heuristic Baseline 30.9 44.3 36.4 36.3 53.4 43.2

Degenerate EM Baseline 70.8 36.3 48.0 69.0 25.1 36.8

Our EM-based Model 42.4 66.0 51.6 55.2 60.6 57.8

Haghighi and Klein Baseline 50.8 40.7 45.2 43.0 40.9 41.9

+ Relaxed Head Generation 48.3 45.7 47.0 40.9 50.0 45.0

+ Agreement Constraints 50.4 47.5 48.9 41.7 51.2 46.0

+ Pronoun-only Salience 52.2 53.0 52.6 44.3 57.3 50.0

Fully Supervised Model 53.0 70.3 60.4 53.1 70.5 60.6

large gain in recall and large drop in precision (over-clustering)

F-score increases for one data set and drops for the other

228

EM-Based Model: MUC Results

In comparison to Degenerate EMlarge drop in recall, but larger gain in precisionF-score increases by 4-21%gains attributed to exploitation of unlabeled data

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Heuristic Baseline 30.9 44.3 36.4 36.3 53.4 43.2

Our EM-based Model 42.4 66.0 51.6 55.2 60.6 57.8

Haghighi and Klein Baseline 50.8 40.7 45.2 43.0 40.9 41.9

+ Relaxed Head Generation 48.3 45.7 47.0 40.9 50.0 45.0

+ Agreement Constraints 50.4 47.5 48.9 41.7 51.2 46.0

+ Pronoun-only Salience 52.2 53.0 52.6 44.3 57.3 50.0

Fully Supervised Model 53.0 70.3 60.4 53.1 70.5 60.6

229

Broadcast News Newswire Experiments on System Mentions

MUC CEAF CEAFV MUC CEAF CEAFV

Heuristic Baseline 36.4 48.4 46.3 43.2 54.2 50.3

Degenerate EM Baseline 48.0 39.4 35.8 36.8 27.9 26.3

Our EM-based Model 51.6 55.7 52.9 57.8 59.6 52.8

Haghighi and Klein Baseline 45.2 45.2 39.0 41.9 48.8 41.7

+ Relaxed Head Generation 47.0 47.5 42.3 45.0 52.6 46.3

+ Agreement Constraints 48.9 51.4 47.0 46.0 54.5 48.4

+ Pronoun-only Salience 52.6 54.7 51.1 50.0 57.4 51.2

Fully Supervised Model 60.4 61.8 59.9 60.6 64.5 60.6

MUC, CEAF, CEAF-Variant F-Scores

230

Broadcast News Newswire Experiments on System Mentions

MUC CEAF CEAFV MUC CEAF CEAFV

Heuristic Baseline 36.4 48.4 46.3 43.2 54.2 50.3

Degenerate EM Baseline 48.0 39.4 35.8 36.8 27.9 26.3

Our EM-based Model 51.6 55.7 52.9 57.8 59.6 52.8

Haghighi and Klein Baseline 45.2 45.2 39.0 41.9 48.8 41.7

+ Relaxed Head Generation 47.0 47.5 42.3 45.0 52.6 46.3

+ Agreement Constraints 48.9 51.4 47.0 46.0 54.5 48.4

+ Pronoun-only Salience 52.6 54.7 51.1 50.0 57.4 51.2

Fully Supervised Model 60.4 61.8 59.9 60.6 64.5 60.6

MUC, CEAF, CEAF-Variant F-Scores

Degenerate EM Baseline performs the worst

231

Broadcast News Newswire Experiments on System Mentions

MUC CEAF CEAFV MUC CEAF CEAFV

Heuristic Baseline 36.4 48.4 46.3 43.2 54.2 50.3

Degenerate EM Baseline 48.0 39.4 35.8 36.8 27.9 26.3

Our EM-based Model 51.6 55.7 52.9 57.8 59.6 52.8

Haghighi and Klein Baseline 45.2 45.2 39.0 41.9 48.8 41.7

+ Relaxed Head Generation 47.0 47.5 42.3 45.0 52.6 46.3

+ Agreement Constraints 48.9 51.4 47.0 46.0 54.5 48.4

+ Pronoun-only Salience 52.6 54.7 51.1 50.0 57.4 51.2

Fully Supervised Model 60.4 61.8 59.9 60.6 64.5 60.6

MUC, CEAF, CEAF-Variant F-Scores

EM-based Model outperforms Heuristic Baseline

232

Broadcast News Newswire Experiments on System Mentions

MUC CEAF CEAFV MUC CEAF CEAFV

Heuristic Baseline 36.4 48.4 46.3 43.2 54.2 50.3

Degenerate EM Baseline 48.0 39.4 35.8 36.8 27.9 26.3

Our EM-based Model 51.6 55.7 52.9 57.8 59.6 52.8

Haghighi and Klein Baseline 45.2 45.2 39.0 41.9 48.8 41.7

+ Relaxed Head Generation 47.0 47.5 42.3 45.0 52.6 46.3

+ Agreement Constraints 48.9 51.4 47.0 46.0 54.5 48.4

+ Pronoun-only Salience 52.6 54.7 51.1 50.0 57.4 51.2

Fully Supervised Model 60.4 61.8 59.9 60.6 64.5 60.6

MUC, CEAF, CEAF-Variant F-Scores

Addition of each extension yields improvements in F-score

233

Broadcast News Newswire Experiments on System Mentions

MUC CEAF CEAFV MUC CEAF CEAFV

Heuristic Baseline 36.4 48.4 46.3 43.2 54.2 50.3

Degenerate EM Baseline 48.0 39.4 35.8 36.8 27.9 26.3

Our EM-based Model 51.6 55.7 52.9 57.8 59.6 52.8

Haghighi and Klein Baseline 45.2 45.2 39.0 41.9 48.8 41.7

+ Relaxed Head Generation 47.0 47.5 42.3 45.0 52.6 46.3

+ Agreement Constraints 48.9 51.4 47.0 46.0 54.5 48.4

+ Pronoun-only Salience 52.6 54.7 51.1 50.0 57.4 51.2

Fully Supervised Model 60.4 61.8 59.9 60.6 64.5 60.6

MUC, CEAF, CEAF-Variant F-Scores

Extended H&K system performs comparably with EM-based model

234

Broadcast News Newswire Experiments on System Mentions

MUC CEAF CEAFV MUC CEAF CEAFV

Heuristic Baseline 36.4 48.4 46.3 43.2 54.2 50.3

Degenerate EM Baseline 48.0 39.4 35.8 36.8 27.9 26.3

Our EM-based Model 51.6 55.7 52.9 57.8 59.6 52.8

Haghighi and Klein Baseline 45.2 45.2 39.0 41.9 48.8 41.7

+ Relaxed Head Generation 47.0 47.5 42.3 45.0 52.6 46.3

+ Agreement Constraints 48.9 51.4 47.0 46.0 54.5 48.4

+ Pronoun-only Salience 52.6 54.7 51.1 50.0 57.4 51.2

Fully Supervised Model 60.4 61.8 59.9 60.6 64.5 60.6

MUC, CEAF, CEAF-Variant F-Scores

Unsupervised models lag performance of the supervised model

235

Unsupervised Coreference as EM ClusteringDesign a generative model that can be used to induce a

clustering of the mentions in a given document

Exploit pairwise linguistic constraints gender and number agreement, semantic compatibility, …

236

Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,

where Cij = 1 iff mentions i and j are coreferent

Facilitates the incorporate of pairwise linguistic constraints

1 2 3 4 5

1

2

3

4

5

1 2 3 4 5

1

2

3

4

5

Valid Invalid

237

Strong Coreference Indicators

String match Alias (one is an acronym or abbreviation of the other) Appositive

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Proper, Common }

Use 7 linguistic features

Features

238

Computing the E-stepGoal: assign a probability to each possible clustering of the

mentions in a document

239

Computing the E-stepGoal: assign a probability to each possible clustering of the

mentions in a document

Computationally intractable: number of clusterings is exponential in the number of mentions

240

Computing the E-stepGoal: assign a probability to each possible clustering of the

mentions in a document

Computationally intractable: number of clusterings is exponential in the number of mentions

Search for the N most probable clusterings only

241

Computing the E-stepGoal: assign a probability to each possible clustering of the

mentions in a document

Computationally intractable: number of clusterings is exponential in the number of mentions

Search for the N most probable clusterings onlyusing Luo et al.’s (2004) search algorithm

structure the search space as a Bell tree

242

A Bell Tree

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

243

The Bell-Tree Search AlgorithmFinds the N most probable paths from the root to a leaf

using a beam search

The probability of a clustering (or partition) is the probability assigned to the corresponding path

244

Degenerate EM Baselinemodel that is obtained after one iteration of EM

initializes model parameters based on labeled documentapplies the model (and Bell tree search) to obtain the most

probable coreference partition

no parameter re-estimation on the unlabeled data

245

Noun Phrase CoreferenceIdentify the noun phrases (or mentions) that refer to the

same real-world entity

Partition the set of mentions into coreference equivalence classes

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. A renowned

speech therapist, was summoned to help the King

overcome his speech impediment...

246

Supervised Coreference Resolution

Lots of prior work on supervised coreference resolutionSoon et al. (2001), Strube et al. (2002), Yang et al. (2003),

Luo et al. (2004), Denis and Baldridge (2007), …

247

1 2 3 4 5

1

2

3

4

5

Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,

where Cij = 1 iff mentions i and j are coreferent

Reflexivity

248

Approximating the E-step

Search for the N most probable clusterings onlyusing Luo et al.’s (2004) search algorithm

structures the search space as a Bell tree takes as input the pairwise coreference probabilities scores a clustering based on these probabilities

249

Haghighi and Klein’s ModelCluster-level model

assigns a cluster id to each mentionensures transitivity automatically

Nonparametric Bayesian modeldoes not commit to a particular set of parameters

250

Model Parameters

)|( 3,

2,

1 cmpmpmpP

)|( 6,

5,

4 cmpmpmpP

)|( 7 cmpP

imp are the feature values

{ Coref, Not Coref }c

251

The ACE 2003 coreference corpus3 data sets (Broadcast News, Newswire, Newspaper)each has a training set and a test set; evaluate on test set only

Mentionssystem mentions (mentions extracted by an NP chunker)perfect mentions (mentions extracted from answer key)

Scoring programs: recall, precision, F-measureMUC scoring program (Vilain et al., 1995)CEAF scoring program (Luo, 2005)CEAF variant

same as CEAF, but ignores singleton clusters

Experimental Setup

252

Experimental SetupThe ACE 2003 coreference corpus

3 data sets (Broadcast News, Newswire, Newspaper)each has a training set and a test set; evaluate on test set only

Mentionssystem mentions (mentions extracted by an NP chunker)perfect mentions (mentions extracted from answer key)

253

FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }

254

FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }

255

FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }

256

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

2,

1DPairs ijijijij CmpmpmpPCP

)|()|()( 6,

5,

43,

2,

1ijijijijijijijij CmpmpmpPCmpmpmpPCP

)|( 7ijij CmpP

257

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

3 mentions: 1, 2, 3

[123]

258

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

3 mentions: 1, 2, 3

[123] [1][2][3]

259

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

3 mentions: 1, 2, 3

[123] [12][3][13][2] [1][23][1][2][3]

260

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

3 mentions: 1, 2, 3

[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.32 0.11 0.29 0.05

261

3 mentions: 1, 2, 3

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.32 0.11 0.29 0.05

Iterate till convergence

262

The Induction Algorithm

Given a set of unlabeled documentsguess a clustering for each document according to P(C)

estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

assign a probability to each possible clustering of the mentions for each document (E-step)

3 mentions: 1, 2, 3

Iterate till convergence

How to cope with the computational complexity

of the E-step?

[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.32 0.11 0.29 0.05

263

Goals

Design a new model for unsupervised coreference resolution

Improve Haghighi and Klein’s model with three modifications

264

Evaluation ResultsBroadcast News

Recall: 53.1, Precision: 45.5, F-measure: 49.0

NewswireRecall: 57.2, Precision: 50.3, F-measure: 53.5

265

Evaluation ResultsBroadcast News

Recall: 53.1, Precision: 45.5, F-measure: 49.0

NewswireRecall: 57.2, Precision: 50.3, F-measure: 53.5

Can we improve performance by combining labeled and unlabeled data?

266

Haghighi and Klein’s Generative StoryFor each mention encountered in a document,

generate a cluster id for the mention (according to some cluster id distribution)

generate the head noun of the mention (according to some cluster-specific head distribution)

EM-based Generative Model H&K’s Generative Model

For each mention, guess the cluster id according to P(cluster id)

Generate feature values

Create mention pairsFor each pair, guess whether it

is COREF or NOT COREF according to P(COREF)

Generate feature values

267

Dirichlet Process

Generate new cluster ids as needed

Probability of generating some existing cluster id i is:

for some constant

1 iiclusterinalreadymentionsofnumber

higher probability for larger clusters

number of mentions already in cluster i

268

Dirichlet Process

Generate new cluster ids as needed

Probability of generating some existing cluster id i is:

for some constant

Probability of generating some new cluster id is:

1 iiclusterinalreadymentionsofnumber

1 i

number of mentions already in cluster ihigher probability for larger clusters

269

Input: correct partition, system partition

3

4, 7

2, 5, 8

6

1, 9

System partitionCorrect partition

8, 11, 12

2, 6, 7

1, 4, 9

3, 5, 10

The CEAF Scoring Program

Recast the scoring problem as bipartite matching

270

Input: correct partition, system partition

3

4, 7

2, 5, 8

6

1, 9

System partitionCorrect partition

8, 11, 12

2, 6, 7

1, 4, 9

3, 5, 10

The CEAF Scoring Program

Recast the scoring problem as bipartite matching

Find the best matching using the Hungarian Algorithm

271

Input: correct partition, system partition

3

4, 7

2, 5, 8

6

1, 9

System partitionCorrect partition

6, 11, 12

2, 7, 8

1, 4, 9

3, 5, 10

The CEAF Scoring Program

2

2

1

1

Recast the scoring problem as bipartite matching

Find the best matching using the Hungarian Algorithm

272

Input: correct partition, system partition

3

4, 7

2, 5, 8

6

1, 9

System partitionCorrect partition

6, 11, 12

2, 7, 8

1, 4, 9

3, 5, 10

The CEAF Scoring Program

2

2

1

1

Recast the scoring problem as bipartite matching

Matching score = 6

Find the best matching using the Hungarian Algorithm

273

Input: correct partition, system partition

3

4, 7

2, 5, 8

6

1, 9

System partitionCorrect partition

6, 11, 12

2, 7, 8

1, 4, 9

3, 5, 10

The CEAF Scoring Program

2

2

1

1

Recast the scoring problem as bipartite matching

Matching score = 6

Recall = 6 / 9 = 0.66

Prec = 6 / 12 = 0.5

F-measure = 0.57

Find the best matching using the Hungarian Algorithm

274

Standard Supervised Learning ApproachClassification

given a description of two mentions, mi and mj, classify the pair as coreferent or not coreferent

create one training instance for each pair of mentions from a training text feature vector: describes the two mentions

275

Standard Supervised Learning ApproachClassification

given a description of two mentions, mi and mj, classify the pair as coreferent or not coreferent

create one training instance for each pair of mentions from a training text feature vector: describes the two mentions

[Queen Elizabeth] set about transforming [her] [husband], ...

coref ?

not coref ?

coref ?

276

Haghighi and Klein’s Generative StoryFor each mention encountered in a document,

generate a cluster id for the mention (according to some cluster id distribution)

generate the head noun of the mention (according to some cluster-specific head distribution)

277

Haghighi and Klein’s Generative StoryFor each mention encountered in a document,

generate a cluster id for the mention (according to some cluster id distribution)

generate the head noun of the mention (according to some cluster-specific head distribution)

The probability of generating a particular cluster id is based on some distribution that specifies P(id=1), P(id=2), P(id=3), … but we don’t know the number of clusters a priori don’t know how many probabilities to specify for distribution a distribution over an unknown number of clusters

278

Dirichlet Process

Generate new cluster ids as needed

279

Dirichlet Process

Generate new cluster ids as needed

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

1 1 2

2 ?

280

Dirichlet Process

Generate new cluster ids as needed

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

1 1 2

2 ?

Should we generate id 1 or 2, or should we generate a new id 3?

281

Dirichlet Process

Generate new cluster ids as needed

Probability of generating some existing cluster id i is proportional to the number of mentions already in cluster i

282

Dirichlet Process

Generate new cluster ids as needed

Probability of generating some existing cluster id i is proportional to the number of mentions already in cluster i

higher probability for larger clusters

283

Dirichlet Process

Generate new cluster ids as needed

Probability of generating some existing cluster id i is proportional to the number of mentions already in cluster I

Probability of generating some new cluster id is proportional to some constant α

higher probability for larger clusters

Recommended