Web People Search

P H D TH E S I S DE F E N S E

JAVIER ARTILES PICÓNN L P & I R G R O U P , U N E D , M A D R I D

P H D S U P E R V IS O R S :

J U L I O G O N Z A L O A R R O Y O

E N R I Q U E A M I G O C A B R E R A

Web People Search

1

Finding people on the Web…

Web person profiling 80% U.S. companies check the web before hiring someone

In 30% cases web results impact hiring decision (source: notoriety.com).

Popularity & reputation management.

Further Natural Language processing. Biographical attribute extraction.

Summarization.

Simply, find out information about an individual.

2I. Introduction

Diving in… mixed results

1 - fitness guru

2 - photographer

3 - photographer

4 - photographer

5 - advertising Supervisor at Flamingo Las Vegas

6 - advertising Supervisor at Flamingo Las Vegas

7 - empty blog ?

8 - St. Louis, MO

9 - 55 years old LAS VEGAS, Nevada, Estados Unidos

10 - fitness guru

3I. Introduction

I. Introduction

Wikipedia lists 19 different people named “Michael Moore” …

Diving in… multiple celebrities

4

… but, only one person monopolizes the top Web search results

Diving in… query refinements

Yes users can (and do) refine their queries, but…

How to know which refinement yields the better results ?

5I. Introduction

if too general, we might include non-relevant documents

… actually there are two politicians with that name

Michael Moore politician

if too specific, we might miss relevant documents

… he has had other occupations

Michael Moore Missisippi attorney-general

How relevant is this problem?

11-17% of Web queries include a person name

4% of Web queries are just a person name

U.S. Census Bureau: 90,000 names shared by 100,000,000 people

Web People Search engines available since 2005 (Spock, Zoominfo, Arnetminer, etc.)

6I. Introduction

What we get What we want

fitness guru

•www.thatsfit.com/bloggers/martha-edwards/•www.thecardioblog.com/bloggers/martha-edwards/

photographer

•www.marthaedwards.ca•www.thecancerblog.com/bloggers/martha-edwards/

advertising Supervisor at Flamingo Las Vegas

•www.linkedin.com/pub/martha-edwards/4/378/136

St. Louis, Mo

•www.facebook.com/meedwards?ref=mf

Stagecoach Plc, United Kingdom

•www.zoominfo.com/.../Edwards_Martha_1175619539.aspx

7I. Introduction

This is not an easy task

8I. Introduction

Goals

9

Formalize the name disambiguation problem in Web search results:

Review the name disambiguation problem in the state of the art.

Motivate empirically the need for automatic methods.

Create an evaluation framework:

Define a task.

Create a testbed corpus.

Adopt evaluation methodology and quality measures.

Analyze the impact of different document representations.

I. Introduction

How we addressed the problem

Task formalization

Preliminary studies

First evaluation campaign

Data acquisition

Community building

Evaluation methodology refinement

Second evaluation campaignConsolidatedmethodology

Empirical studies

10I. Introduction

Web People Search

11

I. Introduction.

II. Benchmarking.

I. The WePS-1 Campaign.

II. Comparison of Extrinsic Clustering Evaluation Metrics Based on Formal Constraints.

III. Combining Evaluation Metrics via the Unanimous Improvement Ratio.

IV. The WePS-2 Campaign.

III. Empirical Studies.

I. The Scope of Query Refinement in the WePS Task.

II. The Role of Named Entities in WePS.

Contributions.

WePS-1: clustering task

search engine

system

12

fitness guru


photographer




St. Louis, Mo




II. i. The WePS-1 Campaign

Testbed generation process

Person name selection

Each names is sent as a query to a Web search engine

Collect the top 100 search results for each name

Manually group the pages according to the individual they refer to

Wikipedia George Foster

James Hamilton

Martha Edwards

Thomas Fraser

Thomas Kirk

13

US Census

ACL’06


Annotation: a nice page

14II. i. The WePS-1 Campaign

Annotation: kind of difficult...


Annotation: frankly, no clue


SIGIR 2005 preliminary testbed

Manual annotation consisted of: Clustering of the pages according to the individual they refer to.

Biographical attributes.

Page classification (home page, part of h.p., reference, other).

Points for improvement:

WePS-1 should concentrate efforts on clustering annotation.

Add more name sources.

Names shared with non-person entities.

Also consider ambiguity within documents (overlapping clustering).


Training Test

names source

entities documents

Wikipedia 23.14 99.00

ECDL06 15.30 99.20

WEB03 5.90 47.20

avg. 10.76 71.02

WePS-1 Training and Test collections

18

The test data turned out to have a much average higher ambiguity, even for the same name sources.

names source

entities documents

Wikipedia 56.50 99.30

ACL06 31.00 98.40

Census 50.30 99.10

avg. 45.93 98.93


Purity (P): rewards clusters without noise

Inverse Purity (IP): rewards grouping items from same category

Fα=0.5: harmonic mean of P, IP

Fα=0.2: bias for IP

One in one

P: 1.00IP: 0.48F0,5:0.65

12

34

56

All in one

P: 0.50 IP: 1.00F0,5: 0.67

1

2

3

4 5

6

Evaluation Metrics Baselines


Cheat system(Paul Kalmar)

P: 0.75IP: 1.00F0,5: 0.86

112 2

3 34

4

6 65 5

Purity measures can be cheated in WePS!

purity inv. purity Fa=0.5

S4 0.81 Cheat S 1.00 S1 0.79

S3 0.75 S14 0.95 Cheat S 0.78

S2 0.73 S13 0.93 S2 0.77

S1 0.72 S15 0.91 S3 0.77

Cheat S 0.64 S5 0.90 S4 0.69

S6 0.60 S10 0.89 S5 0.67

S9 0.58 S7 0.88 S6 0.66

S8 0.55 S1 0.88 S7 0.64

S5 0.53 S12 0.83 S8 0.62

S7 0.50 S11 0.82 S9 0.61

WePS-1 Systems ranking

20

team F α=0.5 purity inv. purity

CU_COMSEM 0.79 0.72 0.88

CHEAT_SYSTEM 0.78 0.64 1.00

IRST-BP 0.77 0.75 0.80

PSNUS 0.77 0.73 0.82

UVA 0.69 0.81 0.60

FICO 0.67 0.53 0.90

UNN 0.66 0.60 0.73

ONE_IN_ONE 0.64 1.00 0.47

AUG 0.64 0.50 0.88

SWAT-IV 0.62 0.55 0.71

UA-ZSA 0.61 0.58 0.64

TITPI 0.60 0.45 0.89

JHU1-13 0.58 0.45 0.82

DFKI2 0.53 0.39 0.83

WIT 0.52 0.36 0.93

UC3M_13 0.51 0.35 0.95

UBC-AS 0.45 0.30 0.91

ALL_IN_ONE 0.45 0.29 1.00

The most common system configuration:

• Full document BoW• HAC (single link)

• Cosine similarity• Trained similarity threshold

Frequent “singleton” people in WePS-1

Alpha parametrizationhas a strong effect in the systems ranking


WePS-1 Summary

21

Variability across test cases is large and unpredictable.

Testbed creation is more difficult and expensive than expected.

Purity measures can be cheated ! Are purity and inverse purity the best options available

among clustering metrics ?

Metrics combination has a strong effect in how we measure the contribution of systems. How does the combination of metrics affect the systems

ranking ?


Web People Search

22

I. Introduction.

II. Benchmarking.








Contributions.

Comparing clustering evaluation metrics

Which of the current clustering metrics is more appropriate for the WePS task ?

We compare different families of clustering metrics.

We define constraints in order to characterize metric

families.

We adapt metrics to the overlapping clustering problem.

23II. ii. Clustering Evaluation Metrics

Formal constraints: Cluster homogeneity

24

o Let S be a set of items belonging to categories L1 … Ln.

o Let D1 be a cluster distribution with one cluster C containing items from two categories Li, Lj.

o Let D2 be a distribution identical to D1, except for the fact that the cluster C is split into two clusters containing the items with category Li and the items with

category Lj, respectively.

o Then Q(D1) < Q(D2).

Human validation: 92 %

II. ii. Clustering Evaluation Metrics

Formal constraints: Cluster completeness

25

o Let D1 be a distribution such as that two clusters C1, C2 only contain items belonging to the same category L.

o Let D2 be an identical distribution, except for the fact that C1 and C2 are merged into a single cluster.




Formal constraints: Rag Bag

26

o Let Cclean be a cluster with n items belonging to the same category.

o Let Cnoisy be a cluster merging n items from unary categories.

o Let D1 be a distribution with a new item from a new category merged with the highly clean cluster Cclean, and D2 another distribution with this new item merged with the highly noisy cluster Cnoisy .




Formal constraints: Cluster size vs. quantity

27

o Let us consider a distribution D containing a cluster Cl with n+1 items belonging to the same category L, and n additional clusters C1 … Cn, each of them containing two items from the same category L1 … Ln.

o If D1 is a new distribution similar to D, where each Ci is split in two unary clusters, and D2 is a distribution similar to D, where Cl is split in one cluster of size n and one cluster of size 1.



Human validation: 100%

Comparison of evaluation

metrics

28

BCubed

Pairs counting

Entropy

Edit distance

Set matching


BCubed Precision and Recall


Evaluation on overlapping clustering

30

If n people are mentioned in a document with the same name, this document should appear in n clusters.

The metrics reviewed so far do not consider overlapping clustering.


BCubed extended for overlapping clustering

31

To extend BCubed we must take into account the multiplicity of item occurrences in clusters and classes:

Precision decreases when two elements share too many clusters

Multiplicity precision and recall are integrated in the overall BCubed metrics:

Recall decreases when two elements share too few clusters


BCubed extended for overlapping clustering

32

Perfect clustering

Recall(e1, e2) = min(2,2)/2 = 1Precision(e1, e2) = min(2,2)/2 = 1

e

e e

e

Losing Recall

Recall(e1, e2) = min(1,2)/2 = 0.5Precision(e1, e2) = min(1,2)/1 = 1

e

e e

Losing Precision

Recall(e1, e2) = min(3,2)/2 = 1Precision(e1, e2) = min(3,2)/3 = 0.66

e

e

e

ee

e


WePS-1 results revisited

33

Purity and Inverse Purity BCubed Precision and Recall


Clustering Metrics Summary

We have proposed a set of formal constraints for clustering evaluation metrics.

The combination of BCubed precision and recall is the only one that satisfies all constraints.

We have extended BCubed to handle overlapping clustering.

We have tested BCubed extended with the WePS-1 results and found that it effectively discriminates the baselines and the cheat system.


Web People Search

35

I. Introduction.

II. Benchmarking.








Contributions.

How does the combination of metrics affect the systems ranking ?

36

Ranking is highly sensitive to α parametrization in F.

II. iii. Unanimous Improvement Ratio

0,25

0,35

0,45

0,55

0,65

0,75

0,85

0,95

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

S1

S2

S3

S4

S5

S6

S7

S8

S9

S10

S11

S12

S13

S14

S15

S16

seq-1

seq-100

F v

alu

e

F parameterization (α value) Bias towards

precisionBias towards

recall

Can get statistical significance for contradictory results if α is

changed

seq-1 S14 p (Wilcoxon)

Fα=0.5 0.61 0.49 0.022

Fα=0.2 0.52 0.66 0.015

Unanimous Improvement Ratio

37

Counts the number of topics for which system a improves system b

according to all evaluation metrics.



38

UIR Rewards robustness across α values

0,07

0,32

0

0,1

0,2

0,3

0,4

0,5

∆Fα=0.5 UIR

0,08

0,42

0

0,1

0,2

0,3

0,4

0,5

∆Fα=0.5 UIR

0,07

0,39

0

0,1

0,2

0,3

0,4

0,5

∆Fα=0.5 UIR


Metrics Combination Summary

The comparison of systems in clustering tasks is highly sensitive to the metrics combination criterion.

UIR allows us to combine metrics without assigning relative weights to each metric.

UIR rewards robust improvements across different alpha values of F-measure.

UIR is a complementary method to assess the best approach during the system training process.

39II. iii. Unanimous Improvement Ratio

Web People Search

40

I. Introduction.

II. Benchmarking.








Contributions.

WePS clustering task and ...

searchengine

system

41

fitness guru


photographer




St. Louis, Mo




II. iv. The WePS-2 Campaign

Input Output

• John TaitName

• Specialist Information Systems Consulting ServicesOccupation

• http://johntait.netHomepage

• Information Retrieval FacilityAffiliation

• ViennaLocation

• Chief Scientific OfficerWork

… we also included an Attribute Extraction taskSatoshi Sekine and Javier Artiles. WePS2 Attribute Extraction Task.

In 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference, 2009.

42II. iv. The WePS-2 Campaign

WePS-2 data

43

Training set: WePS 1 dataset (same methodology & size).

Followed WePS 1 guidelines.

10 x 3 new ambiguous person names (Wikipedia, US census and ACL'08 PC members).

150 web pages from the top search results.

HTML pages as well as search results metadata (snippet, rank...).

Filtered out non-HTML documents and pages without the name on them.

Also developed a GUI for the annotation task


Annotation: public profiles from social networks


Annotation: genealogies



WePS-1 vs. WePS-2 datasets

Average ambiguity is much lower on the WePS 2 data.

There is still a wide variety of ambiguity cases.

As it did on WePS-1, this added an extra challenge to the task.

WePS-1 data WePS-2 data

Training

source entities

Wikipedia 23.14

ECLD06 15.30

Web03 5.90

avg. 10.76

46

Test

source entities

Wikipedia 10.70

ACL06 14.20

Census 30.30

avg. 18.46

Test

source entities

Wikipedia 56.50

ACL06 31.00

Census 50.30

avg. 45.93

WePS-2 Clustering Resultsusing B-Cubed

Baselines:

• One-in-one

• All-in one

• Cheat system

• Hierarchical Agglomerative Cluster (HAC) with tokens

• HAC with bigrams

Upper bounds:

• Oracle HAC w tokens

• Oracle HAC w bigrams


Macroaveraged Scores

F-measures BCubed

rank run α=0.5 α=0.2 Pre. Rec.

BEST-HAC-TOKENS 0.85 0.84 0.89 0.83

BEST-HAC-BIGRAMS 0.85 0.83 0.91 0.81

1 PolyUHK 0.82 0.80 0.87 0.79

2 UVA_1 0.81 0.80 0.85 0.80

3 ITC-UT_1 0.81 0.76 0.93 0.73

4 XMEDIA_3 0.72 0.68 0.82 0.66

5 UCI_2 0.71 0.77 0.66 0.84

6 LANZHOU_1 0.70 0.67 0.80 0.66

7 FICO_3 0.70 0.64 0.85 0.62

8 UMD_4 0.70 0.63 0.94 0.60

HAC-BIGRAMS 0.67 0.59 0.95 0.55

9 UGUELPH_1 0.63 0.75 0.54 0.93

10 CASIANED_4 0.63 0.68 0.65 0.75

HAC-TOKENS 0.59 0.52 0.95 0.48

11 AUG_4 0.57 0.56 0.73 0.58

12 UPM-SINT_4 0.56 0.59 0.60 0.66

ALL_IN_ONE 0.53 0.66 0.43 1.00

CHEAT_SYS 0.52 0.65 0.43 1.00

13 UNN_2 0.52 0.48 0.76 0.47

14 ECNU_1 0.41 0.44 0.50 0.55

15 UNED_3 0.40 0.38 0.66 0.39

16 PRIYAVEN 0.39 0.37 0.61 0.38

ONE_IN_ONE 0.34 0.27 1.00 0.24

17 BUAP_1 0.33 0.27 0.89 0.25


System F0.5 Improved Systems (UIR > 0.25)Reference

systemUIR for the

reference system

(S1) PolyUHK 0.82 S2 S4 S6 S7 S8 S11 … S17 B1 - -

(S2) ITC-UT_1 0.81 S4 S6 S7 S8 S11 … S17 B1 S1 0.26

(S3) UVA_1 0.81 S2 S4 S7 S8 S11 … S17 B1 - -

(S4) XMEDIA_3 0.72 S11 S13 … S17 S1 0.58

(S5) UCI_2 0.71 S12 … S16 - -

(S6) UMD_4 0.70 S4 S7 S11 S13 … S17 B1 S1 0.35

(S7) FICO_3 0.70 S11 S13 … S17 S2 0.65

(S8) LANZHOU_1 0.70 S11 … S17 S1 0.74

(S9) UGUELPH_1 0.63 S4 S12 S14 S16 - -

(S10) CASIANED_4 0.63 S12 … S16 - -

(S11) AUG_4 0.57 S14 … S17 S3 0.68

(S12) UPM-SINT_4 0.56 S14 S16 S1 0.71

(B100) ALL_IN_ONE 0.53 Bcheat - -

(S13) UNN_2 0.52 S15 S16 S1 0.90

(Bcheat) CHEAT_SYS 0.52 - B100 0.65

(S14) ECNU_1 0.41 - S1 0.90

(S15) UNED_3 0.40 S16 S1 0.97

(S16) PRIYAVEN 0.39 - S1 1.00

(B1) ONE_IN_ONE 0.34 S17 S1 0.29

(S17) BUAP_1 0.33 - S6 0.84 48

Results of UIR on the WePS-2 dataset

Run Features Feat. weighting Similarity Clustering

PolyUHK

Local sentences, full text BoW, URL tokens, title tokens in root page, unigram and bigrams, snippet based features. TFIDF Cosine similarity HAC

UVA_1stemmed words (Porter stemmer, standard stopwords list) Modified TFIDF Cosine similarity HAC

ITC-UT_1 NEs, compound key words, link features Overlap coefficient Two stage HAC

UMD_4Tokens, NEs, variations of the ambiguous name, hyperlinks Jaro-Winkler, Jaccard HAC

XMEDIA_3 Local unigrams and bigrams, Self informationCosine similarity and learned similarity metrics QT variant

UCI_2NEs, web overlap statistics for person and organization TFIDF

Cosine similarity, Skyline classifier for web based features Two stage clust.

LANZHOU_1 NEs , email, phone, date, occupation TFIDF Cosine similarity HAC

FICO_3NEs, URL tokens, page title tokens, NE lists, name match, gender

Heuristic based on matching and non matching features

Greedy agglomeration within a block

UGUELPH_1 Full text BoW Modified TFIDF Chamaleon Clust.

CASIANED_4NEs, tokens, URL tokens, snippet

TFIDF for tokens, special weighting for NEs Cosine similarity

Classify pages according to the person profession.

AUG_4

place/date of birth/death, NEs, IP address, geo. location coordinates, weighted keywords, URL, email address, telephone, fax Gain ratio Cosine similarity

Fuzzy ants clustering, Agnes (hierarchical clustering)

UPM-SINT_4 Full text BoW Word overlap

ECNU_1 Stemmed words seleted based o chi^2 measure Term frequency Cosine similarity K-means

PRIYAVEN Full text BoW TFIDF Weighted Jaccard Fuzzy ants clustering

UNED_3Relevant terms extracted with language model techniques

Kullback-Leibler divergence

Language models and cosine similarity Heuristic

BUAP_1 NEs Term frequency

49

WePS-2 summary

Consolidation of the WePS community: 17 research teams took part in the WePS-2 clustering task.

WePS-2 now provides benchmarking datasets and standardized evaluation metrics for the clustering and attribute extraction subtasks.

Now we can empirically answer questions such as:

How good are manual query refinements in WePS ?

What is the role of Named Entities in this task ?


Web People Search

51

I. Introduction.

II. Benchmarking.








Contributions.

Query Refinements in WePS

52

How good are manual query refinements ?

Are they a feasible people search strategy ?

III. i. The Scope of Query Refinement in the WePS Task


53

Tokens, bigrams, trigrams …

•football, publications,

research

•curriculum vitae, full

professor

Named Entities(person, location, organization)

•John Smith, Mary Jones

•Kansas City, Suntherland

•University of Suntherland

Manually extracted attributes(occupation, affiliation, email…)

•Occupation: Full professor

•born in 1940

•born in London

John Tait + refinement

Precision 6/8

Recall 6/6

Coverage 1

retrieved documentsrelevantnon relevant

John Tait doc. collection from

WePS

Trying to find John Tait, the researcher



54

We will simulate query refinements for the people in the WePS testbed.

The best query refinements will be obtained from the documents and applied to refine the corresponding name document set.



Results for popular people (clusters of size >= 3)

for test case we selectthe best …

F α=0.5 precision recall coverage

token 0.87 0.90 0.86 1.00

bigram 0.79 0.95 0.70 1.00

trigram 0.75 0.96 0.65 1.00… … … … …

Best n-gram 0.89 0.95 0.85 1.00

affiliation 0.51 0.96 0.39 0.81

occupation 0.52 0.93 0.40 0.80

email 0.35 0.96 0.23 0.33… … … … …

Best manual attribute 0.60 0.97 0.47 0.92

location 0.62 0.87 0.53 1.00

organization 0.67 0.96 0.56 1.00

person 0.59 0.95 0.47 1.00

Best named entity 0.74 0.95 0.63 1.00

Best 0.89 0.96 0.85 1.00

55

Very good results using all refinement types…

… but tokens and word n-grams also achieve the highest results.

There is usually at least one QR that leads to the desired set of results…

… but not necessarily an intuitive choice

Lower coverage in the manually extracted QRs

Manually tagged attributes: very precise, but they are not always present

Scope of Query Refinements: Summary

There is not a single type of refinement that leads to optimal results, but a combination of diverse types.

Search results clustering might indeed be of practical help assist users searching for people in the Web.

56III. i. The Scope of Query Refinement in the WePS Task

Web People Search

57

I. Introduction.

II. Benchmarking.








Contributions.

How effective are Named Entities compared to other features for document representation in WePS ?

Document representation: NEs vs other approaches

58

D1: 0.1 0.23 0 0 0 0.43

D2: 0.40 0 0.43 0.23

D3:23 0.54 0.54 0 0 0

D4: 0.40 0 0.43 0.23

…

Dn:0.1 0.23 0 0 0 0.43

Document collection

Similarity ClusteringDocument

representationDocument

representation

III. ii. The Role of Named Entities in WePS

Reformulating the WePS task

59

Single features

Classification task over coreferent

document pairs

Non dependent on clustering algorithm.

WePS1 and WePS2 corpora: 293,000 document pairs.

Similarity between pairs using each

feature.

WePS1 and WePS2 corpora: 293,000

document pairs.

Evaluate results:Precision and Recall.


Token based features

60

Tokens provide the best overall performance


Word n-gram based features

61

n-grams: more precise than single tokens at the

cost of recall


Named entities: Stanford NE tagger

62

Taken individually,NEs do not improve tokens


Reformulating the WePS task

63

Combination of features

WePS1 and WePS2 corpora: 293,000

document pairs.

Similarity using feature

combinations

Evaluate results:Machine Learning

andUpper boundary


Combining similarity criteria

64

PWA measures the classification accuracy of one similarity criterion

PWA(x)=Prob(Simx ( )>Simx( ))DA DA' DBDC

MaxPWA(<x1,x2...,xn>) =

Prob(∃ xi X .Simxi ( )> Simxi( ))DA DA' DBDC

MaxPWA estimates the upper boundary accuracy of a set of similarity criteria


Combining similarity criteria

65

Decision Tree and MaxPWA results are consistent.

Adding new features to tokens improves the classification.

NEs do not offer a competitive advantage when compared to non-linguistic features.


Tokens

All features

(including NEs)

Tokens + ngrams

Is our setting competitive with state of the art systems ?


66

D1: 0.1 0.23 0 0 0 0.43

D2: 0.40 0 0.43 0.23

D3:23 0.54 0.54 0 0 0

D4: 0.40 0 0.43 0.23

…

Dn:0.1 0.23 0 0 0 0.43

Document collection

Similarity ClusteringDocument

representationClustering


Results on the clustering taks

67

Output of the Decision Tree classifier as similarity metric.

These similarities where fed into a Hierarchical Clustering Algorithm.

A distance threshold was trained using WePS-1 data.

Comparable to the bestparticipant in WePS-2.

Adding NEs does not improveresults



68

Individual features

Feature combinations

Validation on the clustering task


Named entities do not seem to provide a

competitive advantage in the clustering process

when compared to a combination of simplerfeatures (tokens, n-grams, etc.).

This is not a prescription against the use of NEs:

They can be appropriate for presentation purposes.

Other approaches might be able to improve results using

NEs information.

Role of Named Entities Summary

69III. ii. The Role of Named Entities in WePS

Web People Search

70

I. Introduction.

II. Benchmarking.








Contributions.

Insights Products

Study and characterization of the available evaluation metrics.

Extension of BCubed for overlapped

clustering tasks.

Development of metrics combination

method (Unanimous Improvement Ratio) that is not dependent on weighting.

Query refinements are effective but very

diverse and unfeasible in a WePS scenario.

Named Entities do not seem to provide a competitive advantage over simpler

features.

Development of a reference testbed for the task.

Currently more than 80 citations to the

WePS-1 task description paper.

Several paper use WePS data as

the de facto standard for the task.

A document clustering evaluation package.

An annotation GUI for document grouping tasks.

Contributions

71IV. Contributions

Further directions

Exploration of new approaches to the representation of documents (Wikipedia, Google n-gram corpus).

Application of the evaluation methods developed for WePS to other domains and tasks.

Future WePS evaluation campaigns.

Search for organizations.

Multilingual search (documents in different languages referring to the same person).

A new task integrating the clustering and attribute extraction problems.

72IV. Contributions

73

Thank you !

Previous work

Related NLP tasks:

Cross Document Coreference.

Word Sense Disambiguation.

Word Sense Induction.

Test collections:

Until 2006 newswire collections, Web collections are predominant now.

Manual annotation, but also pseudo-ambiguity generation.

In most cases, created ad-hoc for a particular research.

Disambiguation methods:

Hierarchical Agglomerative Clustering (HAC) is the most frequently employed method.

74I. Introduction


75

UIR Reflects the range of improvement

UIR = 0.03 UIR = 0.45 UIR = 0.77

Cross-document Coreference Web People Search

Tries to link mentions to the same entities in a

collection of texts.

Groups documents that contain a mention to the

same individual.

Web People Search vs. other NLP tasks

[…] Captain John Smith (c. January

1580 – June 21, 1631) […]

[…] John Smith was an

English adventurer[…]

Doc . 1Doc.2

Doc . 1

Doc . 1

[…] John Smith was an

English adventurer[…]

Doc . 2

76

Word Sense Disambiguation

Web People Search

Can rely on dictionaries to define the number of “senses” of anambiguous term.

Common words disambiguation.

The number of senses is not knowa priori.

Person names disambiguation.

Person names disambiguation.

Web pages, open domain.

Web People Search vs. other NLP tasks

Word Sense Induction

Common words disambiguation

Citation disambiguation

Handles very structuredinformation on a closed domain(scientific literature).

77

WePS-1 Training and test collections

78

Attributes

Occupation, affiliation & work are the most

common

Most attributes appear

in less than 1/10 of the documents

79

Attribute

Extraction Results

Difficult task!

80

Scores per attribute

81

Different

Attributes, Different Results

4 types of attributes based on the

characteristics

Attribute Description Performance Comments

Phone,

FAX,

email,

Website

There is a typical

pattern

R: 74-40 (ECNU, UvA)

Disambiguation is needed.

Degree,

Nationality

Unfamiliar NE, but

candidates are limited

R: 43-42 (CASIANED)

We need a good NE tagger for the

category. Maybe possible.

D. Of birth, Birth place,

Other name,

Affiliation,

School,

Mentor,

Relative

Typical NE, disambiguati

on is needed

R: 55-17

(MIVTU, UvA, PolyUHK)

NE tagger is ready. We need

good disambiguation

Award,

Major,

Occupation

Unfamiliar and difficult

NE type

R: 17-38

(UvA)

We need a good NE tagger for the

category. It looks very difficult

82

Typical

SystemStrategy

Most systems use two phase strategy

1. Find the candidates

• Use NE tagger, gazetteer, regular expression to find candidates which have the same type to the target attribute

2. Filter (verify) the candidates

• Select only those which are the attribute-values of the target person. It can be done by local pattern, supervised classification, distance & cue phrase.

83

PWA(x)=Prob(Simx ( )>Simx( ))DA DA' DBDC

The classification accuracy of one similarity criterion is:

We want to learn the relative weigh of feature classes (e.g. person names vs. tokens)

Evaluation by a machine learn algorithm: (e.g. Decision tree) Upperbound of any algorithm: When combining similarity criteria,

at least one of them should identify the coreferent document pair:

MaxPWA(<x1,x2...,xn>)=

Prob(∃ xi X .Simx

i( )> Simx

i( ))DA DA' DB

DC

Combining similarity criteria: PWA and MaxPWA (upper boundary)

84

WePS-1 summary

We have built a manual testbed corpus for development and evaluation of WePS systems. 47 person names, almost 4700 documents. Double annotation of the test data.

We have done a systematic evaluation and comparison of WePSsystems. 29 expressedtheir interest in the task. 16 teams submittedresults within the deadline

Variability accross test cases is large and unpredictable.

Testbed creation is more difficult and expensive than expected.

Purity measures can be cheated ! Are purity and inverse purity the best options available among clustering metrics ?

The combination of metrics has a strong effect in how we measure the contribution of systems (baselines are an extreme case). How does the combination of metrics affect the systems ranking ?

87

Contributions

A study of the actual need name disambiguation systems.

In most cases a there is an optimal query refinement for an individual…

… but this refinement is unlikely to be known in advance.

a date, a related person name, a place, the title of a book ?

Results support the interest raised in the scientific community and the Web search business.

88

Contributions

Development of reference test collections.

We have carried two dedicated evaluation campaigns: WePS-1 and WePS-2.

The problem has been stadarised as a serch results mining task (clustering and IE).

Creation of standard benchmarks for the WePS task.

Around 8,000 manually annotated web documents.

Including biographical features in WePS-2.

Manual annotation for WePS has shown to be a difficult process.

Lack of context in some documents, uncertainty even when infromation is available ,high ambiguity, etc.

Too much information (genealogies)… or too little (public profiles from social networks).

Importance of training assesors and reaching a concensus for clustering two documents.

89

Contributions

Development of improved clustering evaluation metrics.

We have defined four constraints for clustering quality metrics.

We have tested these constraints againsta families of clustering metrics.

Only BCubed satisfies all constraints.

An additional constraint was defined to account for overlapping clustering.

BCubed has been extended for overlapping clustering and succesfully applied to WePS results.

The Unanimous Improvement Ratio measure has been proposed.

Complements Precision and Recall weighting functions (F-measure)

Indicates the robustness of improvements across different α values of F.

90

Contributions

The relevance of the Clustering Stopping Criterion

Ambiguity of person names is very variable.

From one to more than 70 in the top 100 search results.

This variability represents a challenge for clustering systems,

A baseline system can achieve higher scores than the best team by using the best similarity threshold for each topic.

Training a stopping criterion with a baseline approach achieves poor results...

… but a competitive results is achieved if we also train the relative weight of document similarity metrics.

In terms of evaluation the trade-off between precision and recall metrics is determined by the stopping criteria.

This leads to a high variability of rankings depending on the evaluation metrics combination.

UIR provides complementary information in this context.

91

Contributions

Study of the role of Named Entities and other features in the WePS task.

In the clustering process, NEs are not necessarily more useful thanfeatures such as word n-grams.

In our experiments, linguistic information (NEs, noun phrases, etc.) did not provide better results than computationally cheap featuressuch as tokens and word n-grams.

More sophisticated ways of using this type of information might yieldbetter results.

92

Contributions

A large testbed for the WePS task.

Manually annotated collections for WePS-1 and WePS-2 campaigns.

Also available pre-processed , annotated with NLP tools and indexed with Lucene.

An annotation GUI has been developed to ease the the manual grouping of web documents.

It can be reused for annotation on other disambiguation problems.

An evaluation package.

Includes standard clustering evaluation metrics.

Implements BCubed metrics and the Unanimous Improvement Ratio measure.

93

Further directions

Exploration of new approaches to the representation of documents (Wikipedia, Google n-gram corpus).

Application of the evaluation methods developed for WePS to other domains and tasks.

Future WePS evaluation campaigns.

Search for organisations.

Multilingual search (documents in different languages referring to the same person).

A new task integrating the clustering and attribute extraction problems.

94

Engineering

Web People Search