27
GOVERNMENT USERS Conference “Navigating the Human Terrain” College Park, MD, May 20-21, 2008 Linguistic Considerations of Identity Resolution David Murgatroyd Software Architect Basis Technology

Linguistic Considerations of Identity Resolution (2008)

Embed Size (px)

Citation preview

GOVERNMENT USERSConference“Navigating the Human Terrain” College Park, MD, May 20-21, 2008

Linguistic Considerations of Identity ResolutionDavid MurgatroydSoftware ArchitectBasis Technology

2

Outline

Introduction Linguistic Challenges

Variation (Intentional & Unintentional) Composition Frequency Under-specification Multilinguality

Integration Challenges Inputs & Outputs Properties

Evaluation Challenges Corpora: Find or Build? Metrics: Adopt or Create?

Conclusion

3

Introduction: An Exercise

Jim Killeen Kileen, J. D.

Jaime Kilin

جمس كلين

Is there a >50% chance these refer to the same person? If…US Citizens; On a ferry to Spain;In a documentary

4

What is Identity Resolution?

Identity Resolution (aka Entity Resolution): determining if two or more given references refer to

the same entity.

Different from name matching as it’s about identity of entities not similarity of names

See also: Murgatroyd, D. Some Linguistic Considerations of

Entity Resolution and Retrieval. In Proceedings of LREC 2008 Workshop on Resources and Evaluation for Identity Matching, Entity Resolution and Entity Management.

5

What sorts of references?

Non-linguistic reference examples: Numerical identifiers

— SSN— Some portions of address (Street Number, Zip Code)

Visual identifiers (e.g., pictures, symbols) Biometrics (e.g., DNA, iris, signature, voice)

Linguistic reference examples: Nouns or pronouns in documents (e.g., “the CEO of Basis”) Names of associated/related entities

— Locations (e.g., Street or City Name)— Organizations— Individuals

Name of entity <- we’re going to focus on this one

6

Let’s focus on names of people

Common and familiar Often fairly identifying piece of personal

information Demonstrate typical challenges of resolution

with linguistic data

7

Outline

Introduction Linguistic Challenges

Variation (Intentional & Unintentional) Composition Frequency Under-specification Multilinguality

Integration Challenges Inputs & Outputs Properties

Evaluation Challenges Corpora: Find or Build? Metrics: Adopt or Create?

Conclusion

8

Variation (Intentional)

Variation may be intentional References may be draw on a large set of names:

— Formality (e.g., nicknames)— Transparency (e.g., aliases)— Location (e.g., toponym)— Life status

Vocation (e.g., titles) Marital status (e.g., marriage/divorce/widowhood) Parenthood (e.g., patronymic) Faith (e.g., christening, pilgrimage) Death (e.g., posthumous names)

— Dialect (e.g., adolescent girls preferring “Jenni” over “Jenny”)— Style of text (e.g., “Sollermun” for “Solomon” in Huck Finn)

Jim Killeen

9

Variation (Unintentional)

Variation may be unintentional, arising from: Typos

— E.g., “Killeen” vs. “Kileen” Guessing spelling based on pronunciation

— E.g., “Caliin” Ambiguities inherent in the encoding (e.g., Unicode):

— Characters with the same glyph E.g., Latin and Cyrillic small “i”

— Characters with similar glyphs E.g., Latin “K” and Greenlandic “ĸ”

— Characters with composed/combined forms E.g., ņ (n with cedilla) vs. n (n + combining cedilla)

Kileen, J. D.

10

Composition

Names have differing orders: Given v. Surname: “Killen, Jim” v. “Jim Killeen” Varies by culture

Name references may be partial: “Jim” v. “Jim Killeen”

11

Under-specification

Name components may be abbreviated Initials (e.g., “J. D.”) Abbreviations (e.g., “Jas.”)

Name references may have incomplete… orthography (e.g., Semitic languages) segmentation (e.g., Asian languages) phonology (e.g., Ideographic languages)

Kileen, J. D.

جمس كلين

12

Frequency

Any person can make up a name (an open class) A few are common, most are very uncommon Zipfian distribution Lesson:

Valuable to know common names

Valuable to have a strategy for unknown names

13

Multilinguality

Names may appear in many languages-of-use This leads to variation at many linguistic levels. Orthographic:

transliteration confronts skew in:—orthographic-to-phonetic mappings of source and

target languages-of-use—sound systems between the languages

James Klein <-> جمس كلين

14

Multilinguality (cont’d)

Syntactic: different languages-of-use may imply different name

word order

Semantic: name words which communicate meaning (e.g.,

titles) may vary (e.g., “Jr.” for “الصغر “which means “the younger”)

Pragmatic: different languages-of-use may use different names

based on the audience (e.g., “Mr. Laden” vs. “المير ”which means “the prince”)

15

Outline

Introduction Linguistic Challenges

Variation (Intentional & Unintentional) Composition Frequency Under-specification Multilinguality

Integration Challenges Inputs & Outputs Properties

Evaluation Challenges Corpora: Find or Build? Metrics: Adopt or Create?

Conclusion

16

Inputs & Outputs

Inputs options include: Pair-wise: simple integration, but no shared effort Set-based: harder integration, but able to optimize

Output options include: Feature-based: with weights/tuning Probability-based:

—more principled combination—NOTE: similarity is not probability

17

Integration Properties

Certain properties help make efficient implementations: Reflexivity:

—Resolve(a,a) is always true—NOTE: does not imply Resolve(a,a’) where a~a’

Commutativity:—Resolve(a,b) Resolve(b,a)

Transitivity:—Resolve(a,b) & Resolve(b,c) => Resolve(a,c)

18

Outline

Introduction Linguistic Challenges

Variation (Intentional & Unintentional) Composition Frequency Under-specification Multilinguality

Integration Challenges Inputs & Outputs Properties

Evaluation Challenges Corpora: Find or Build? Metrics: Adopt or Create?

Conclusion

19

Corpora: Find or Build?

Requirements: Annotated for ground truth Represent linguistic challenges Scalable/practical

Options Adapt public “database” corpora:

— Wikipedia: Annotated: yes Representative: somewhat Scalable: yes

— Citation DBs: Annotated: no Representative: somewhat Scalable: yes

20

Corpora: Find or Build? (cont’d)

Adapt public “document” corpora: — Co-reference documents:

Annotated: yes Representative: less as often single doc/language-of-use Scalable: yes

Create corpora by hand:— From scratch: “parrot sessions” (auditory or visual)

Annotated: yes Representative: largely Scalable: no

— From un-annotated databases: Annotated: no Representative: yes Scalable/practical: no; databases may be private

— Synthesize from generative model Annotated: yes Representative: no, tied to generating model Scalable: yes

21

Metrics

Back to our initial example

Jim Killeen Kileen, J. D.

Jaime Kilin

جمس كلين

Jim

JDKJimK illeen

J. Diw Killeen

ReferenceSystem ASystem B

22

Metrics: Adopt or Create?

How to quantify the quality of the system’s resolutions vs. the reference?

Goals: Discriminative: separates good v. bad systems for users’ needs Interpretable: number aligns with intuition

Considerations: Assume transitive closure (TC) of output? Apply weights to try to be more discriminative?

Common concepts: Precision: % of stuff in answer that’s right Recall: % of right stuff in answer F-Score: Harmonic mean of these = 2*P*R/(P+R)

23

Candidate Metrics

Pair-wise % correct: over all N*(N-1)/2 node pairs Pair-wise P&R: based on links drawn Edit-distance: # of links to add/subtract to correct Metrics used in document co-reference resolution:

MUC-6: entity-based P&R on missing links from graph B-CUBED: average per-reference P&R of links CEAF (Constrained Entity-Alignment F): entities aligned

using some similarity measure; P&R are % of possible similarity level achieved

24

Comparing Metrics

Jim Killeen

Jaime Kilin

جمس كلين

Jim

JDKJimK illeen

J. Diw KilleenReferenceSystem ASystem B

Kileen, J. D.

No TCTC

36

14

Edit-dist

81858973717982B90788062618279A

No TCTCNo TCTCCEAF(TC)

B-CUBED(TC)

MUC-6(TC)

Pairwise F% Correct

My preference

25

Conclusion

Identity resolution systems face linguistic challenges

They need to be carefully integrated to meet these challenges

Evaluation corpora should reflect these challenges

Evaluation metrics should align with qualitative judgements

26

Bibliography

Bagga, A., Baldwin., B. (1998). Algorithms for scoring coreference chains. In Proceedings of the First International Conference on Language Resources and Evaluation Workshop on Linguistic Coreference.

Fellegi, I. P., Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, Vol. 64, No. 328, pp. 1183--1210.

Luo, X. (2005). On coreference resolution performance metrics. In Proc. of HLT-EMNLP, pp 25--32.

Menestrina, D., Benjelloun, O., Garcia-Molina, H. (2006). Generic entity resolution with data confidences. In First International VLDB Workshop on Clean Databases. Seoul, Korea.

Murgatroyd, D. Some Linguistic Considerations of Entity Resolution and Retrieval. In Proceedings of LREC 2008 Workshop on Resources and Evaluation for Identity Matching, Entity Resolution and Entity Management.

Spock Team (2008). The Spock Challenge. http://challenge.spock.com/ (Retrieved February 5.)

Vilain, M. Burger, J. Aberdeen, J. Connolly, D., Hirschman, L. (1995). A model-theoretic coreference scoring scheme. In Proceedings of the 6th Message Understanding Conference (MUC6). Morgan Kaufmann, pp. 45--52.

27

Questions?

More information:http://www.basistech.com