92
PHD THESIS DEFENSE JAVIER ARTILES PICÓN NLP & IR GROUP, UNED, MADRID PHD SUPERVISORS: JULIO GONZALO ARROYO ENRIQUE AMIGO CABRERA Web People Search 1

Web People Search

Embed Size (px)

Citation preview

Page 1: Web People Search

P H D TH E S I S DE F E N S E

JAVIER ARTILES PICÓNN L P & I R G R O U P , U N E D , M A D R I D

P H D S U P E R V IS O R S :

J U L I O G O N Z A L O A R R O Y O

E N R I Q U E A M I G O C A B R E R A

Web People Search

1

Page 2: Web People Search

Finding people on the Web…

Web person profiling 80% U.S. companies check the web before hiring someone

In 30% cases web results impact hiring decision (source: notoriety.com).

Popularity & reputation management.

Further Natural Language processing. Biographical attribute extraction.

Summarization.

Simply, find out information about an individual.

2I. Introduction

Page 3: Web People Search

Diving in… mixed results

1 - fitness guru

2 - photographer

3 - photographer

4 - photographer

5 - advertising Supervisor at Flamingo Las Vegas

6 - advertising Supervisor at Flamingo Las Vegas

7 - empty blog ?

8 - St. Louis, MO

9 - 55 years old LAS VEGAS, Nevada, Estados Unidos

10 - fitness guru

3I. Introduction

Page 4: Web People Search

I. Introduction

Wikipedia lists 19 different people named “Michael Moore” …

Diving in… multiple celebrities

4

… but, only one person monopolizes the top Web search results

Page 5: Web People Search

Diving in… query refinements

Yes users can (and do) refine their queries, but…

How to know which refinement yields the better results ?

5I. Introduction

if too general, we might include non-relevant documents

… actually there are two politicians with that name

Michael Moore politician

if too specific, we might miss relevant documents

… he has had other occupations

Michael Moore Missisippi attorney-general

Page 6: Web People Search

How relevant is this problem?

11-17% of Web queries include a person name

4% of Web queries are just a person name

U.S. Census Bureau: 90,000 names shared by 100,000,000 people

Web People Search engines available since 2005 (Spock, Zoominfo, Arnetminer, etc.)

6I. Introduction

Page 7: Web People Search

What we get What we want

fitness guru

•www.thatsfit.com/bloggers/martha-edwards/•www.thecardioblog.com/bloggers/martha-edwards/

photographer

•www.marthaedwards.ca•www.thecancerblog.com/bloggers/martha-edwards/

advertising Supervisor at Flamingo Las Vegas

•www.linkedin.com/pub/martha-edwards/4/378/136

St. Louis, Mo

•www.facebook.com/meedwards?ref=mf

Stagecoach Plc, United Kingdom

•www.zoominfo.com/.../Edwards_Martha_1175619539.aspx

7I. Introduction

Page 8: Web People Search

This is not an easy task

8I. Introduction

Page 9: Web People Search

Goals

9

Formalize the name disambiguation problem in Web search results:

Review the name disambiguation problem in the state of the art.

Motivate empirically the need for automatic methods.

Create an evaluation framework:

Define a task.

Create a testbed corpus.

Adopt evaluation methodology and quality measures.

Analyze the impact of different document representations.

I. Introduction

Page 10: Web People Search

How we addressed the problem

Task formalization

Preliminary studies

First evaluation campaign

Data acquisition

Community building

Evaluation methodology refinement

Second evaluation campaignConsolidatedmethodology

Empirical studies

10I. Introduction

Page 11: Web People Search

Web People Search

11

I. Introduction.

II. Benchmarking.

I. The WePS-1 Campaign.

II. Comparison of Extrinsic Clustering Evaluation Metrics Based on Formal Constraints.

III. Combining Evaluation Metrics via the Unanimous Improvement Ratio.

IV. The WePS-2 Campaign.

III. Empirical Studies.

I. The Scope of Query Refinement in the WePS Task.

II. The Role of Named Entities in WePS.

Contributions.

Page 12: Web People Search

WePS-1: clustering task

search engine

system

12

fitness guru

•www.thatsfit.com/bloggers/martha-edwards/•www.thecardioblog.com/bloggers/martha-edwards/

photographer

•www.marthaedwards.ca•www.thecancerblog.com/bloggers/martha-edwards/

advertising Supervisor at Flamingo Las Vegas

•www.linkedin.com/pub/martha-edwards/4/378/136

St. Louis, Mo

•www.facebook.com/meedwards?ref=mf

Stagecoach Plc, United Kingdom

•www.zoominfo.com/.../Edwards_Martha_1175619539.aspx

II. i. The WePS-1 Campaign

Page 13: Web People Search

Testbed generation process

Person name selection

Each names is sent as a query to a Web search engine

Collect the top 100 search results for each name

Manually group the pages according to the individual they refer to

Wikipedia George Foster

James Hamilton

Martha Edwards

Thomas Fraser

Thomas Kirk

13

US Census

ACL’06

II. i. The WePS-1 Campaign

Page 14: Web People Search

Annotation: a nice page

14II. i. The WePS-1 Campaign

Page 15: Web People Search

Annotation: kind of difficult...

15II. i. The WePS-1 Campaign

Page 16: Web People Search

Annotation: frankly, no clue

16II. i. The WePS-1 Campaign

Page 17: Web People Search

SIGIR 2005 preliminary testbed

Manual annotation consisted of: Clustering of the pages according to the individual they refer to.

Biographical attributes.

Page classification (home page, part of h.p., reference, other).

Points for improvement:

WePS-1 should concentrate efforts on clustering annotation.

Add more name sources.

Names shared with non-person entities.

Also consider ambiguity within documents (overlapping clustering).

17II. i. The WePS-1 Campaign

Page 18: Web People Search

Training Test

names source

entities documents

Wikipedia 23.14 99.00

ECDL06 15.30 99.20

WEB03 5.90 47.20

avg. 10.76 71.02

WePS-1 Training and Test collections

18

The test data turned out to have a much average higher ambiguity, even for the same name sources.

names source

entities documents

Wikipedia 56.50 99.30

ACL06 31.00 98.40

Census 50.30 99.10

avg. 45.93 98.93

II. i. The WePS-1 Campaign

Page 19: Web People Search

Purity (P): rewards clusters without noise

Inverse Purity (IP): rewards grouping items from same category

Fα=0.5: harmonic mean of P, IP

Fα=0.2: bias for IP

One in one

P: 1.00IP: 0.48F0,5:0.65

12

34

56

All in one

P: 0.50 IP: 1.00F0,5: 0.67

1

2

3

4 5

6

Evaluation Metrics Baselines

19II. i. The WePS-1 Campaign

Cheat system(Paul Kalmar)

P: 0.75IP: 1.00F0,5: 0.86

112 2

3 34

4

6 65 5

Purity measures can be cheated in WePS!

purity inv. purity Fa=0.5

S4 0.81 Cheat S 1.00 S1 0.79

S3 0.75 S14 0.95 Cheat S 0.78

S2 0.73 S13 0.93 S2 0.77

S1 0.72 S15 0.91 S3 0.77

Cheat S 0.64 S5 0.90 S4 0.69

S6 0.60 S10 0.89 S5 0.67

S9 0.58 S7 0.88 S6 0.66

S8 0.55 S1 0.88 S7 0.64

S5 0.53 S12 0.83 S8 0.62

S7 0.50 S11 0.82 S9 0.61

Page 20: Web People Search

WePS-1 Systems ranking

20

team F α=0.5 purity inv. purity

CU_COMSEM 0.79 0.72 0.88

CHEAT_SYSTEM 0.78 0.64 1.00

IRST-BP 0.77 0.75 0.80

PSNUS 0.77 0.73 0.82

UVA 0.69 0.81 0.60

FICO 0.67 0.53 0.90

UNN 0.66 0.60 0.73

ONE_IN_ONE 0.64 1.00 0.47

AUG 0.64 0.50 0.88

SWAT-IV 0.62 0.55 0.71

UA-ZSA 0.61 0.58 0.64

TITPI 0.60 0.45 0.89

JHU1-13 0.58 0.45 0.82

DFKI2 0.53 0.39 0.83

WIT 0.52 0.36 0.93

UC3M_13 0.51 0.35 0.95

UBC-AS 0.45 0.30 0.91

ALL_IN_ONE 0.45 0.29 1.00

The most common system configuration:

• Full document BoW• HAC (single link)

• Cosine similarity• Trained similarity threshold

Frequent “singleton” people in WePS-1

Alpha parametrizationhas a strong effect in the systems ranking

II. i. The WePS-1 Campaign

Page 21: Web People Search

WePS-1 Summary

21

Variability across test cases is large and unpredictable.

Testbed creation is more difficult and expensive than expected.

Purity measures can be cheated ! Are purity and inverse purity the best options available

among clustering metrics ?

Metrics combination has a strong effect in how we measure the contribution of systems. How does the combination of metrics affect the systems

ranking ?

II. i. The WePS-1 Campaign

Page 22: Web People Search

Web People Search

22

I. Introduction.

II. Benchmarking.

I. The WePS-1 Campaign.

II. Comparison of Extrinsic Clustering Evaluation Metrics Based on Formal Constraints.

III. Combining Evaluation Metrics via the Unanimous Improvement Ratio.

IV. The WePS-2 Campaign.

III. Empirical Studies.

I. The Scope of Query Refinement in the WePS Task.

II. The Role of Named Entities in WePS.

Contributions.

Page 23: Web People Search

Comparing clustering evaluation metrics

Which of the current clustering metrics is more appropriate for the WePS task ?

We compare different families of clustering metrics.

We define constraints in order to characterize metric

families.

We adapt metrics to the overlapping clustering problem.

23II. ii. Clustering Evaluation Metrics

Page 24: Web People Search

Formal constraints: Cluster homogeneity

24

o Let S be a set of items belonging to categories L1 … Ln.

o Let D1 be a cluster distribution with one cluster C containing items from two categories Li, Lj.

o Let D2 be a distribution identical to D1, except for the fact that the cluster C is split into two clusters containing the items with category Li and the items with

category Lj, respectively.

o Then Q(D1) < Q(D2).

Human validation: 92 %

II. ii. Clustering Evaluation Metrics

Page 25: Web People Search

Formal constraints: Cluster completeness

25

o Let D1 be a distribution such as that two clusters C1, C2 only contain items belonging to the same category L.

o Let D2 be an identical distribution, except for the fact that C1 and C2 are merged into a single cluster.

o Then Q(D1) < Q(D2).

II. ii. Clustering Evaluation Metrics

Human validation: 90 %

Page 26: Web People Search

Formal constraints: Rag Bag

26

o Let Cclean be a cluster with n items belonging to the same category.

o Let Cnoisy be a cluster merging n items from unary categories.

o Let D1 be a distribution with a new item from a new category merged with the highly clean cluster Cclean, and D2 another distribution with this new item merged with the highly noisy cluster Cnoisy .

o Then Q(D1) < Q(D2).

II. ii. Clustering Evaluation Metrics

Human validation: 95 %

Page 27: Web People Search

Formal constraints: Cluster size vs. quantity

27

o Let us consider a distribution D containing a cluster Cl with n+1 items belonging to the same category L, and n additional clusters C1 … Cn, each of them containing two items from the same category L1 … Ln.

o If D1 is a new distribution similar to D, where each Ci is split in two unary clusters, and D2 is a distribution similar to D, where Cl is split in one cluster of size n and one cluster of size 1.

o Then Q(D1) < Q(D2).

II. ii. Clustering Evaluation Metrics

Human validation: 100%

Page 28: Web People Search

Comparison of evaluation

metrics

28

BCubed

Pairs counting

Entropy

Edit distance

Set matching

II. ii. Clustering Evaluation Metrics

Page 29: Web People Search

BCubed Precision and Recall

29II. ii. Clustering Evaluation Metrics

Page 30: Web People Search

Evaluation on overlapping clustering

30

If n people are mentioned in a document with the same name, this document should appear in n clusters.

The metrics reviewed so far do not consider overlapping clustering.

II. ii. Clustering Evaluation Metrics

Page 31: Web People Search

BCubed extended for overlapping clustering

31

To extend BCubed we must take into account the multiplicity of item occurrences in clusters and classes:

Precision decreases when two elements share too many clusters

Multiplicity precision and recall are integrated in the overall BCubed metrics:

Recall decreases when two elements share too few clusters

II. ii. Clustering Evaluation Metrics

Page 32: Web People Search

BCubed extended for overlapping clustering

32

Perfect clustering

Recall(e1, e2) = min(2,2)/2 = 1Precision(e1, e2) = min(2,2)/2 = 1

e

e e

e

Losing Recall

Recall(e1, e2) = min(1,2)/2 = 0.5Precision(e1, e2) = min(1,2)/1 = 1

e

e e

Losing Precision

Recall(e1, e2) = min(3,2)/2 = 1Precision(e1, e2) = min(3,2)/3 = 0.66

e

e

e

ee

e

II. ii. Clustering Evaluation Metrics

Page 33: Web People Search

WePS-1 results revisited

33

Purity and Inverse Purity BCubed Precision and Recall

II. ii. Clustering Evaluation Metrics

Page 34: Web People Search

Clustering Metrics Summary

We have proposed a set of formal constraints for clustering evaluation metrics.

The combination of BCubed precision and recall is the only one that satisfies all constraints.

We have extended BCubed to handle overlapping clustering.

We have tested BCubed extended with the WePS-1 results and found that it effectively discriminates the baselines and the cheat system.

34II. ii. Clustering Evaluation Metrics

Page 35: Web People Search

Web People Search

35

I. Introduction.

II. Benchmarking.

I. The WePS-1 Campaign.

II. Comparison of Extrinsic Clustering Evaluation Metrics Based on Formal Constraints.

III. Combining Evaluation Metrics via the Unanimous Improvement Ratio.

IV. The WePS-2 Campaign.

III. Empirical Studies.

I. The Scope of Query Refinement in the WePS Task.

II. The Role of Named Entities in WePS.

Contributions.

Page 36: Web People Search

How does the combination of metrics affect the systems ranking ?

36

Ranking is highly sensitive to α parametrization in F.

II. iii. Unanimous Improvement Ratio

0,25

0,35

0,45

0,55

0,65

0,75

0,85

0,95

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

S1

S2

S3

S4

S5

S6

S7

S8

S9

S10

S11

S12

S13

S14

S15

S16

seq-1

seq-100

F v

alu

e

F parameterization (α value) Bias towards

precisionBias towards

recall

Can get statistical significance for contradictory results if α is

changed

seq-1 S14 p (Wilcoxon)

Fα=0.5 0.61 0.49 0.022

Fα=0.2 0.52 0.66 0.015

Page 37: Web People Search

Unanimous Improvement Ratio

37

Counts the number of topics for which system a improves system b

according to all evaluation metrics.

II. iii. Unanimous Improvement Ratio

Page 38: Web People Search

Unanimous Improvement Ratio

38

UIR Rewards robustness across α values

0,07

0,32

0

0,1

0,2

0,3

0,4

0,5

∆Fα=0.5 UIR

0,08

0,42

0

0,1

0,2

0,3

0,4

0,5

∆Fα=0.5 UIR

0,07

0,39

0

0,1

0,2

0,3

0,4

0,5

∆Fα=0.5 UIR

II. iii. Unanimous Improvement Ratio

Page 39: Web People Search

Metrics Combination Summary

The comparison of systems in clustering tasks is highly sensitive to the metrics combination criterion.

UIR allows us to combine metrics without assigning relative weights to each metric.

UIR rewards robust improvements across different alpha values of F-measure.

UIR is a complementary method to assess the best approach during the system training process.

39II. iii. Unanimous Improvement Ratio

Page 40: Web People Search

Web People Search

40

I. Introduction.

II. Benchmarking.

I. The WePS-1 Campaign.

II. Comparison of Extrinsic Clustering Evaluation Metrics Based on Formal Constraints.

III. Combining Evaluation Metrics via the Unanimous Improvement Ratio.

IV. The WePS-2 Campaign.

III. Empirical Studies.

I. The Scope of Query Refinement in the WePS Task.

II. The Role of Named Entities in WePS.

Contributions.

Page 41: Web People Search

WePS clustering task and ...

searchengine

system

41

fitness guru

•www.thatsfit.com/bloggers/martha-edwards/•www.thecardioblog.com/bloggers/martha-edwards/

photographer

•www.marthaedwards.ca•www.thecancerblog.com/bloggers/martha-edwards/

advertising Supervisor at Flamingo Las Vegas

•www.linkedin.com/pub/martha-edwards/4/378/136

St. Louis, Mo

•www.facebook.com/meedwards?ref=mf

Stagecoach Plc, United Kingdom

•www.zoominfo.com/.../Edwards_Martha_1175619539.aspx

II. iv. The WePS-2 Campaign

Page 42: Web People Search

Input Output

• John TaitName

• Specialist Information Systems Consulting ServicesOccupation

• http://johntait.netHomepage

• Information Retrieval FacilityAffiliation

• ViennaLocation

• Chief Scientific OfficerWork

… we also included an Attribute Extraction taskSatoshi Sekine and Javier Artiles. WePS2 Attribute Extraction Task.

In 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference, 2009.

42II. iv. The WePS-2 Campaign

Page 43: Web People Search

WePS-2 data

43

Training set: WePS 1 dataset (same methodology & size).

Followed WePS 1 guidelines.

10 x 3 new ambiguous person names (Wikipedia, US census and ACL'08 PC members).

150 web pages from the top search results.

HTML pages as well as search results metadata (snippet, rank...).

Filtered out non-HTML documents and pages without the name on them.

Also developed a GUI for the annotation task

II. iv. The WePS-2 Campaign

Page 44: Web People Search

Annotation: public profiles from social networks

44II. iv. The WePS-2 Campaign

Page 45: Web People Search

Annotation: genealogies

45II. iv. The WePS-2 Campaign

Page 46: Web People Search

II. iv. The WePS-2 Campaign

WePS-1 vs. WePS-2 datasets

Average ambiguity is much lower on the WePS 2 data.

There is still a wide variety of ambiguity cases.

As it did on WePS-1, this added an extra challenge to the task.

WePS-1 data WePS-2 data

Training

source entities

Wikipedia 23.14

ECLD06 15.30

Web03 5.90

avg. 10.76

46

Test

source entities

Wikipedia 10.70

ACL06 14.20

Census 30.30

avg. 18.46

Test

source entities

Wikipedia 56.50

ACL06 31.00

Census 50.30

avg. 45.93

Page 47: Web People Search

WePS-2 Clustering Resultsusing B-Cubed

Baselines:

• One-in-one

• All-in one

• Cheat system

• Hierarchical Agglomerative Cluster (HAC) with tokens

• HAC with bigrams

Upper bounds:

• Oracle HAC w tokens

• Oracle HAC w bigrams

47II. iv. The WePS-2 Campaign

Macroaveraged Scores

F-measures BCubed

rank run α=0.5 α=0.2 Pre. Rec.

BEST-HAC-TOKENS 0.85 0.84 0.89 0.83

BEST-HAC-BIGRAMS 0.85 0.83 0.91 0.81

1 PolyUHK 0.82 0.80 0.87 0.79

2 UVA_1 0.81 0.80 0.85 0.80

3 ITC-UT_1 0.81 0.76 0.93 0.73

4 XMEDIA_3 0.72 0.68 0.82 0.66

5 UCI_2 0.71 0.77 0.66 0.84

6 LANZHOU_1 0.70 0.67 0.80 0.66

7 FICO_3 0.70 0.64 0.85 0.62

8 UMD_4 0.70 0.63 0.94 0.60

HAC-BIGRAMS 0.67 0.59 0.95 0.55

9 UGUELPH_1 0.63 0.75 0.54 0.93

10 CASIANED_4 0.63 0.68 0.65 0.75

HAC-TOKENS 0.59 0.52 0.95 0.48

11 AUG_4 0.57 0.56 0.73 0.58

12 UPM-SINT_4 0.56 0.59 0.60 0.66

ALL_IN_ONE 0.53 0.66 0.43 1.00

CHEAT_SYS 0.52 0.65 0.43 1.00

13 UNN_2 0.52 0.48 0.76 0.47

14 ECNU_1 0.41 0.44 0.50 0.55

15 UNED_3 0.40 0.38 0.66 0.39

16 PRIYAVEN 0.39 0.37 0.61 0.38

ONE_IN_ONE 0.34 0.27 1.00 0.24

17 BUAP_1 0.33 0.27 0.89 0.25

Page 48: Web People Search

II. iv. The WePS-2 Campaign

System F0.5 Improved Systems (UIR > 0.25)Reference

systemUIR for the

reference system

(S1) PolyUHK 0.82 S2 S4 S6 S7 S8 S11 … S17 B1 - -

(S2) ITC-UT_1 0.81 S4 S6 S7 S8 S11 … S17 B1 S1 0.26

(S3) UVA_1 0.81 S2 S4 S7 S8 S11 … S17 B1 - -

(S4) XMEDIA_3 0.72 S11 S13 … S17 S1 0.58

(S5) UCI_2 0.71 S12 … S16 - -

(S6) UMD_4 0.70 S4 S7 S11 S13 … S17 B1 S1 0.35

(S7) FICO_3 0.70 S11 S13 … S17 S2 0.65

(S8) LANZHOU_1 0.70 S11 … S17 S1 0.74

(S9) UGUELPH_1 0.63 S4 S12 S14 S16 - -

(S10) CASIANED_4 0.63 S12 … S16 - -

(S11) AUG_4 0.57 S14 … S17 S3 0.68

(S12) UPM-SINT_4 0.56 S14 S16 S1 0.71

(B100) ALL_IN_ONE 0.53 Bcheat - -

(S13) UNN_2 0.52 S15 S16 S1 0.90

(Bcheat) CHEAT_SYS 0.52 - B100 0.65

(S14) ECNU_1 0.41 - S1 0.90

(S15) UNED_3 0.40 S16 S1 0.97

(S16) PRIYAVEN 0.39 - S1 1.00

(B1) ONE_IN_ONE 0.34 S17 S1 0.29

(S17) BUAP_1 0.33 - S6 0.84 48

Results of UIR on the WePS-2 dataset

Page 49: Web People Search

Run Features Feat. weighting Similarity Clustering

PolyUHK

Local sentences, full text BoW, URL tokens, title tokens in root page, unigram and bigrams, snippet based features. TFIDF Cosine similarity HAC

UVA_1stemmed words (Porter stemmer, standard stopwords list) Modified TFIDF Cosine similarity HAC

ITC-UT_1 NEs, compound key words, link features Overlap coefficient Two stage HAC

UMD_4Tokens, NEs, variations of the ambiguous name, hyperlinks Jaro-Winkler, Jaccard HAC

XMEDIA_3 Local unigrams and bigrams, Self informationCosine similarity and learned similarity metrics QT variant

UCI_2NEs, web overlap statistics for person and organization TFIDF

Cosine similarity, Skyline classifier for web based features Two stage clust.

LANZHOU_1 NEs , email, phone, date, occupation TFIDF Cosine similarity HAC

FICO_3NEs, URL tokens, page title tokens, NE lists, name match, gender

Heuristic based on matching and non matching features

Greedy agglomeration within a block

UGUELPH_1 Full text BoW Modified TFIDF Chamaleon Clust.

CASIANED_4NEs, tokens, URL tokens, snippet

TFIDF for tokens, special weighting for NEs Cosine similarity

Classify pages according to the person profession.

AUG_4

place/date of birth/death, NEs, IP address, geo. location coordinates, weighted keywords, URL, email address, telephone, fax Gain ratio Cosine similarity

Fuzzy ants clustering, Agnes (hierarchical clustering)

UPM-SINT_4 Full text BoW Word overlap

ECNU_1 Stemmed words seleted based o chi^2 measure Term frequency Cosine similarity K-means

PRIYAVEN Full text BoW TFIDF Weighted Jaccard Fuzzy ants clustering

UNED_3Relevant terms extracted with language model techniques

Kullback-Leibler divergence

Language models and cosine similarity Heuristic

BUAP_1 NEs Term frequency

49

Page 50: Web People Search

WePS-2 summary

Consolidation of the WePS community: 17 research teams took part in the WePS-2 clustering task.

WePS-2 now provides benchmarking datasets and standardized evaluation metrics for the clustering and attribute extraction subtasks.

Now we can empirically answer questions such as:

How good are manual query refinements in WePS ?

What is the role of Named Entities in this task ?

50II. iv. The WePS-2 Campaign

Page 51: Web People Search

Web People Search

51

I. Introduction.

II. Benchmarking.

I. The WePS-1 Campaign.

II. Comparison of Extrinsic Clustering Evaluation Metrics Based on Formal Constraints.

III. Combining Evaluation Metrics via the Unanimous Improvement Ratio.

IV. The WePS-2 Campaign.

III. Empirical Studies.

I. The Scope of Query Refinement in the WePS Task.

II. The Role of Named Entities in WePS.

Contributions.

Page 52: Web People Search

Query Refinements in WePS

52

How good are manual query refinements ?

Are they a feasible people search strategy ?

III. i. The Scope of Query Refinement in the WePS Task

Page 53: Web People Search

Query Refinements in WePS

53

Tokens, bigrams, trigrams …

•football, publications,

research

•curriculum vitae, full

professor

Named Entities(person, location, organization)

•John Smith, Mary Jones

•Kansas City, Suntherland

•University of Suntherland

Manually extracted attributes(occupation, affiliation, email…)

•Occupation: Full professor

•born in 1940

•born in London

John Tait + refinement

Precision 6/8

Recall 6/6

Coverage 1

retrieved documentsrelevantnon relevant

John Tait doc. collection from

WePS

Trying to find John Tait, the researcher

III. i. The Scope of Query Refinement in the WePS Task

Page 54: Web People Search

Query Refinements in WePS

54

We will simulate query refinements for the people in the WePS testbed.

The best query refinements will be obtained from the documents and applied to refine the corresponding name document set.

III. i. The Scope of Query Refinement in the WePS Task

Page 55: Web People Search

III. i. The Scope of Query Refinement in the WePS Task

Results for popular people (clusters of size >= 3)

for test case we selectthe best …

F α=0.5 precision recall coverage

token 0.87 0.90 0.86 1.00

bigram 0.79 0.95 0.70 1.00

trigram 0.75 0.96 0.65 1.00… … … … …

Best n-gram 0.89 0.95 0.85 1.00

affiliation 0.51 0.96 0.39 0.81

occupation 0.52 0.93 0.40 0.80

email 0.35 0.96 0.23 0.33… … … … …

Best manual attribute 0.60 0.97 0.47 0.92

location 0.62 0.87 0.53 1.00

organization 0.67 0.96 0.56 1.00

person 0.59 0.95 0.47 1.00

Best named entity 0.74 0.95 0.63 1.00

Best 0.89 0.96 0.85 1.00

55

Very good results using all refinement types…

… but tokens and word n-grams also achieve the highest results.

There is usually at least one QR that leads to the desired set of results…

… but not necessarily an intuitive choice

Lower coverage in the manually extracted QRs

Manually tagged attributes: very precise, but they are not always present

Page 56: Web People Search

Scope of Query Refinements: Summary

There is not a single type of refinement that leads to optimal results, but a combination of diverse types.

Search results clustering might indeed be of practical help assist users searching for people in the Web.

56III. i. The Scope of Query Refinement in the WePS Task

Page 57: Web People Search

Web People Search

57

I. Introduction.

II. Benchmarking.

I. The WePS-1 Campaign.

II. Comparison of Extrinsic Clustering Evaluation Metrics Based on Formal Constraints.

III. Combining Evaluation Metrics via the Unanimous Improvement Ratio.

IV. The WePS-2 Campaign.

III. Empirical Studies.

I. The Scope of Query Refinement in the WePS Task.

II. The Role of Named Entities in WePS.

Contributions.

Page 58: Web People Search

How effective are Named Entities compared to other features for document representation in WePS ?

Document representation: NEs vs other approaches

58

D1: 0.1 0.23 0 0 0 0.43

D2: 0.40 0 0.43 0.23

D3:23 0.54 0.54 0 0 0

D4: 0.40 0 0.43 0.23

Dn:0.1 0.23 0 0 0 0.43

Document collection

Similarity ClusteringDocument

representationDocument

representation

III. ii. The Role of Named Entities in WePS

Page 59: Web People Search

Reformulating the WePS task

59

Single features

Classification task over coreferent

document pairs

Non dependent on clustering algorithm.

WePS1 and WePS2 corpora: 293,000 document pairs.

Similarity between pairs using each

feature.

WePS1 and WePS2 corpora: 293,000

document pairs.

Evaluate results:Precision and Recall.

III. ii. The Role of Named Entities in WePS

Page 60: Web People Search

Token based features

60

Tokens provide the best overall performance

III. ii. The Role of Named Entities in WePS

Page 61: Web People Search

Word n-gram based features

61

n-grams: more precise than single tokens at the

cost of recall

III. ii. The Role of Named Entities in WePS

Page 62: Web People Search

Named entities: Stanford NE tagger

62

Taken individually,NEs do not improve tokens

III. ii. The Role of Named Entities in WePS

Page 63: Web People Search

Reformulating the WePS task

63

Combination of features

WePS1 and WePS2 corpora: 293,000

document pairs.

Similarity using feature

combinations

Evaluate results:Machine Learning

andUpper boundary

III. ii. The Role of Named Entities in WePS

Page 64: Web People Search

Combining similarity criteria

64

PWA measures the classification accuracy of one similarity criterion

PWA(x)=Prob(Simx ( )>Simx( ))DA DA' DBDC

MaxPWA(<x1,x2...,xn>) =

Prob(∃ xi X .Simxi ( )> Simxi( ))DA DA' DBDC

MaxPWA estimates the upper boundary accuracy of a set of similarity criteria

III. ii. The Role of Named Entities in WePS

Page 65: Web People Search

Combining similarity criteria

65

Decision Tree and MaxPWA results are consistent.

Adding new features to tokens improves the classification.

NEs do not offer a competitive advantage when compared to non-linguistic features.

III. ii. The Role of Named Entities in WePS

Tokens

All features

(including NEs)

Tokens + ngrams

Page 66: Web People Search

Is our setting competitive with state of the art systems ?

Document representation: NEs vs other approaches

66

D1: 0.1 0.23 0 0 0 0.43

D2: 0.40 0 0.43 0.23

D3:23 0.54 0.54 0 0 0

D4: 0.40 0 0.43 0.23

Dn:0.1 0.23 0 0 0 0.43

Document collection

Similarity ClusteringDocument

representationClustering

III. ii. The Role of Named Entities in WePS

Page 67: Web People Search

Results on the clustering taks

67

Output of the Decision Tree classifier as similarity metric.

These similarities where fed into a Hierarchical Clustering Algorithm.

A distance threshold was trained using WePS-1 data.

Comparable to the bestparticipant in WePS-2.

Adding NEs does not improveresults

III. ii. The Role of Named Entities in WePS

Page 68: Web People Search

Document representation: NEs vs other approaches

68

Individual features

Feature combinations

Validation on the clustering task

III. ii. The Role of Named Entities in WePS

Page 69: Web People Search

Named entities do not seem to provide a

competitive advantage in the clustering process

when compared to a combination of simplerfeatures (tokens, n-grams, etc.).

This is not a prescription against the use of NEs:

They can be appropriate for presentation purposes.

Other approaches might be able to improve results using

NEs information.

Role of Named Entities Summary

69III. ii. The Role of Named Entities in WePS

Page 70: Web People Search

Web People Search

70

I. Introduction.

II. Benchmarking.

I. The WePS-1 Campaign.

II. Comparison of Extrinsic Clustering Evaluation Metrics Based on Formal Constraints.

III. Combining Evaluation Metrics via the Unanimous Improvement Ratio.

IV. The WePS-2 Campaign.

III. Empirical Studies.

I. The Scope of Query Refinement in the WePS Task.

II. The Role of Named Entities in WePS.

Contributions.

Page 71: Web People Search

Insights Products

Study and characterization of the available evaluation metrics.

Extension of BCubed for overlapped

clustering tasks.

Development of metrics combination

method (Unanimous Improvement Ratio) that is not dependent on weighting.

Query refinements are effective but very

diverse and unfeasible in a WePS scenario.

Named Entities do not seem to provide a competitive advantage over simpler

features.

Development of a reference testbed for the task.

Currently more than 80 citations to the

WePS-1 task description paper.

Several paper use WePS data as

the de facto standard for the task.

A document clustering evaluation package.

An annotation GUI for document grouping tasks.

Contributions

71IV. Contributions

Page 72: Web People Search

Further directions

Exploration of new approaches to the representation of documents (Wikipedia, Google n-gram corpus).

Application of the evaluation methods developed for WePS to other domains and tasks.

Future WePS evaluation campaigns.

Search for organizations.

Multilingual search (documents in different languages referring to the same person).

A new task integrating the clustering and attribute extraction problems.

72IV. Contributions

Page 73: Web People Search

73

Thank you !

Page 74: Web People Search

Previous work

Related NLP tasks:

Cross Document Coreference.

Word Sense Disambiguation.

Word Sense Induction.

Test collections:

Until 2006 newswire collections, Web collections are predominant now.

Manual annotation, but also pseudo-ambiguity generation.

In most cases, created ad-hoc for a particular research.

Disambiguation methods:

Hierarchical Agglomerative Clustering (HAC) is the most frequently employed method.

74I. Introduction

Page 75: Web People Search

Unanimous Improvement Ratio

75

UIR Reflects the range of improvement

UIR = 0.03 UIR = 0.45 UIR = 0.77

Page 76: Web People Search

Cross-document Coreference Web People Search

Tries to link mentions to the same entities in a

collection of texts.

Groups documents that contain a mention to the

same individual.

Web People Search vs. other NLP tasks

[…] Captain John Smith (c. January

1580 – June 21, 1631) […]

[…] John Smith was an

English adventurer[…]

Doc . 1Doc.2

Doc . 1

Doc . 1

[…] John Smith was an

English adventurer[…]

Doc . 2

76

Page 77: Web People Search

Word Sense Disambiguation

Web People Search

Can rely on dictionaries to define the number of “senses” of anambiguous term.

Common words disambiguation.

The number of senses is not knowa priori.

Person names disambiguation.

Person names disambiguation.

Web pages, open domain.

Web People Search vs. other NLP tasks

Word Sense Induction

Common words disambiguation

Citation disambiguation

Handles very structuredinformation on a closed domain(scientific literature).

77

Page 78: Web People Search

WePS-1 Training and test collections

78

Page 79: Web People Search

Attributes

Occupation, affiliation & work are the most

common

Most attributes appear

in less than 1/10 of the documents

79

Page 80: Web People Search

Attribute

Extraction Results

Difficult task!

80

Page 81: Web People Search

Scores per attribute

81

Page 82: Web People Search

Different

Attributes, Different Results

4 types of attributes based on the

characteristics

Attribute Description Performance Comments

Phone,

FAX,

email,

Website

There is a typical

pattern

R: 74-40 (ECNU, UvA)

Disambiguation is needed.

Degree,

Nationality

Unfamiliar NE, but

candidates are limited

R: 43-42 (CASIANED)

We need a good NE tagger for the

category. Maybe possible.

D. Of birth, Birth place,

Other name,

Affiliation,

School,

Mentor,

Relative

Typical NE, disambiguati

on is needed

R: 55-17

(MIVTU, UvA, PolyUHK)

NE tagger is ready. We need

good disambiguation

Award,

Major,

Occupation

Unfamiliar and difficult

NE type

R: 17-38

(UvA)

We need a good NE tagger for the

category. It looks very difficult

82

Page 83: Web People Search

Typical

SystemStrategy

Most systems use two phase strategy

1. Find the candidates

• Use NE tagger, gazetteer, regular expression to find candidates which have the same type to the target attribute

2. Filter (verify) the candidates

• Select only those which are the attribute-values of the target person. It can be done by local pattern, supervised classification, distance & cue phrase.

83

Page 84: Web People Search

PWA(x)=Prob(Simx ( )>Simx( ))DA DA' DBDC

The classification accuracy of one similarity criterion is:

We want to learn the relative weigh of feature classes (e.g. person names vs. tokens)

Evaluation by a machine learn algorithm: (e.g. Decision tree) Upperbound of any algorithm: When combining similarity criteria,

at least one of them should identify the coreferent document pair:

MaxPWA(<x1,x2...,xn>)=

Prob(∃ xi X .Simx

i( )> Simx

i( ))DA DA' DB

DC

Combining similarity criteria: PWA and MaxPWA (upper boundary)

84

Page 85: Web People Search

WePS-1 summary

We have built a manual testbed corpus for development and evaluation of WePS systems. 47 person names, almost 4700 documents. Double annotation of the test data.

We have done a systematic evaluation and comparison of WePSsystems. 29 expressedtheir interest in the task. 16 teams submittedresults within the deadline

Variability accross test cases is large and unpredictable.

Testbed creation is more difficult and expensive than expected.

Purity measures can be cheated ! Are purity and inverse purity the best options available among clustering metrics ?

The combination of metrics has a strong effect in how we measure the contribution of systems (baselines are an extreme case). How does the combination of metrics affect the systems ranking ?

87

Page 86: Web People Search

Contributions

A study of the actual need name disambiguation systems.

In most cases a there is an optimal query refinement for an individual…

… but this refinement is unlikely to be known in advance.

a date, a related person name, a place, the title of a book ?

Results support the interest raised in the scientific community and the Web search business.

88

Page 87: Web People Search

Contributions

Development of reference test collections.

We have carried two dedicated evaluation campaigns: WePS-1 and WePS-2.

The problem has been stadarised as a serch results mining task (clustering and IE).

Creation of standard benchmarks for the WePS task.

Around 8,000 manually annotated web documents.

Including biographical features in WePS-2.

Manual annotation for WePS has shown to be a difficult process.

Lack of context in some documents, uncertainty even when infromation is available ,high ambiguity, etc.

Too much information (genealogies)… or too little (public profiles from social networks).

Importance of training assesors and reaching a concensus for clustering two documents.

89

Page 88: Web People Search

Contributions

Development of improved clustering evaluation metrics.

We have defined four constraints for clustering quality metrics.

We have tested these constraints againsta families of clustering metrics.

Only BCubed satisfies all constraints.

An additional constraint was defined to account for overlapping clustering.

BCubed has been extended for overlapping clustering and succesfully applied to WePS results.

The Unanimous Improvement Ratio measure has been proposed.

Complements Precision and Recall weighting functions (F-measure)

Indicates the robustness of improvements across different α values of F.

90

Page 89: Web People Search

Contributions

The relevance of the Clustering Stopping Criterion

Ambiguity of person names is very variable.

From one to more than 70 in the top 100 search results.

This variability represents a challenge for clustering systems,

A baseline system can achieve higher scores than the best team by using the best similarity threshold for each topic.

Training a stopping criterion with a baseline approach achieves poor results...

… but a competitive results is achieved if we also train the relative weight of document similarity metrics.

In terms of evaluation the trade-off between precision and recall metrics is determined by the stopping criteria.

This leads to a high variability of rankings depending on the evaluation metrics combination.

UIR provides complementary information in this context.

91

Page 90: Web People Search

Contributions

Study of the role of Named Entities and other features in the WePS task.

In the clustering process, NEs are not necessarily more useful thanfeatures such as word n-grams.

In our experiments, linguistic information (NEs, noun phrases, etc.) did not provide better results than computationally cheap featuressuch as tokens and word n-grams.

More sophisticated ways of using this type of information might yieldbetter results.

92

Page 91: Web People Search

Contributions

A large testbed for the WePS task.

Manually annotated collections for WePS-1 and WePS-2 campaigns.

Also available pre-processed , annotated with NLP tools and indexed with Lucene.

An annotation GUI has been developed to ease the the manual grouping of web documents.

It can be reused for annotation on other disambiguation problems.

An evaluation package.

Includes standard clustering evaluation metrics.

Implements BCubed metrics and the Unanimous Improvement Ratio measure.

93

Page 92: Web People Search

Further directions

Exploration of new approaches to the representation of documents (Wikipedia, Google n-gram corpus).

Application of the evaluation methods developed for WePS to other domains and tasks.

Future WePS evaluation campaigns.

Search for organisations.

Multilingual search (documents in different languages referring to the same person).

A new task integrating the clustering and attribute extraction problems.

94