41
1 Identity uncertainty and citation matching Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I. NIPS 2002

Identity uncertainty and citation matching

  • Upload
    shawna

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

Identity uncertainty and citation matching. Some relevant news…. - PowerPoint PPT Presentation

Citation preview

Page 1: Identity uncertainty and citation matching

1

Identity uncertainty and citation matching

Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.

NIPS 2002

Page 2: Identity uncertainty and citation matching

Some relevant news…

Facebook rolled out Graph Search on Tuesday. The natural-language search engine  can, for example, search for "music that people who like Mitt Romney like," or "photos of my friends in 1989." Although it's limited to four topics at the moment - people, places, interests and photos - the queries that users can run against the millions of photos and connections stored among Facebook's billion users are powerful, powerful tools. … Bing - like Google - has begun to try and provide answers to questions, rather than lists of links. But that's the same territory Zuckerberg and Co. have staked out, too.

2

Page 3: Identity uncertainty and citation matching

3

Outline Background of research Key contributions Citation matching and information extraction Identity uncertainty Generative model Experimental results Implications for information extraction more generally

Page 4: Identity uncertainty and citation matching

Background of research

Record linkage (Felegi & Sunter 1969): Naïve Bayes model for record-pair match/mismatch vector

given entity match/mismatch Trained on matched and unmatched pairs Sensitive to population sizes in train/test

Bayesian analysis of identity: Data association literature (multitarget tracking) Huang and R 97 (freeway surveillance)

Previous work on RPMs (Koller and Pfeffer) Previous work on MCMC for RPM++ (Pasula & R 01) CiteSeer not working too well

4

Page 5: Identity uncertainty and citation matching

CiteSeer02: Russell w/4 Norvig Russell S, Norvig P (1995) Artificial Intelligence: A Modern Approach,

Prentice Hall Series in Artificial Intelligence. Englewood Cliffs, New Jersey

Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach, Prentice Hall, 1995.

Russell S.; Norvig, P. Articial Intelligence - A Modern Approach. Prentice-Hall International Editions, 1995.

Russell S.J., Norvig P., (1995) Artificial Intelligence, A Modern Approach. Prentice Hall.

S. Russell and P. Norvig. Articial Intelligence, a Modern Approach. Prentice Hall, New Jersey, NJ, 1995.

Page 6: Identity uncertainty and citation matching

Stuart Russell and Peter Norvig. Artificial intelligence: A modern approach. Prentice-Hall Series on Artificial Intelligence. Prentice-Hall, Englewood Cliffs, New Jersey, 1995.

S. Russell and P Norvig. Artifical Intelligence: a Modern Approach. Prentice Hall, 1995. Book Details from Amazon or Barnes \& Noble

Stuart Russell and Peter Norvig. Articial Intelligence: A Modern Approach. Prentice Hall, 1995.

S. J. Russell and P. Norvig. Artificial Intelligence, a modern approach. Prentice Hall, Upper Saddle River, New Jersey 07458, 1995.

Stuart Russell and Peter Norvig. Artificial Intelligence. A modern approach. Prentice-Hall, 1995.

S. J. Russell and P. Norvig. Articial Intelligence: A Modern Approach. Prentice Hall. 1995.

S. Russell and P. Norvig, Artificial Intelligence A Modern Approach Prentice Hall 1995.

S. Russell and P. Norvig. Introduction to Artificial Intelligence. Prentice Hall, 1995.

Page 7: Identity uncertainty and citation matching

Stuart Russell and Peter Norvig. Artficial Intelligence: A Modern Approach. Prentice-Hall, Saddle River, NJ, 1995.

Stuart Russell and Peter Norvig. Articial Intelligence a modern approach. Prentice Hall series in articial intelligence. Prentice Hall, Upper Saddle River, New Jersey, 1995.

Chapter 18 Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig, Prentice-Hall, 2000.

Dynamics of computational ecosystems. Physical Review A 40:404--421. Russell, S., and Norvig, P. 1995. Artificial Intelligence: A Modern Approach. Prentice Hall.

S. Russell, P. Norvig: Artificial Intelligence -- A Modern Approach, Prentice Hall, 1995.

Russell, S. \& Norvig, P. (1995) Artificial Intelligence: A Modern Appraoch (Englewood Cliffs, NJ: Prentice-Hall). Book Details from Amazon or Barnes \& Noble

Stuart Russell and Peter Norvig. AI: A Modern Approach. Prentice Hall, NJ, 1995.

S. Russell, P. Norvig. Artificial Intelligence: A Modem Approach. Prentice- Hall, Inc., 1995.

Page 8: Identity uncertainty and citation matching

391-414. Russell SJ, Norvig P (

Russell and Peter Norvig, "Artificial Intelligence - A Modern Approach (AIMA)", pp. 33

Stuart Russell and Peter Norvig: Artificial Intelligence: A Modern Approach, Prentice-Hall, 1994.

Russell, S. \& Norvig, P., An Introduction to Artificial Intelligence: A Modern Approach, Prentice Hall International, 1996.

S. Russell, P. Norvig. Artician Intelligence. A modern approach. Prentice Hall, 1995.

Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 1995. Contributing writers: John F. Canny, Jitendra M. Malik, Douglas D. Edwards. ISBN 0-13-103805-2.

Stuart Russell and Peter Norvig. Artificial Intelligence: A Mordern Approach. Prentice Hall, Englewood Cliffs, New Jersey 07632, 1995.

Page 9: Identity uncertainty and citation matching

In Proceedings of the Third Annual Conference on Evolutionary Programming (pp. 131--139). River Edge, NJ: World Scientific. Russell, S.J., \& Norvig, P. 1995. Artificial Intelligence, A Modern Approach. Englewood Cliffs, NJ: Prentice Hall.

John Wiley. Russell, S., \& Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Prentice-Hall, Inc.

Stuart Russell and Peter Norvig: Artifcial Intelligence A Modern Approach, Englewood Clioes, NJ: Prentice Hall, 1995.

In Scherer, K.R. \& Ekman, P. Approaches to Emotion, 13--38. Hillsdale, NJ: Lawrence Erlbaum. Russell, S.J. and Norvig, P. 1995. Artificial Intelligent: A Modern Approach. Englewood Cliffs, NJ: Prentice Hall.

Rosales E, Forthcoming Masters dissertation, Department of Computer Science, University of Essex, Colchester UK Russell S and Norvig P (1995) Artificial Intelligence: A Modern Approach. Prentice Hall: Englewood Cliffs, New Jersey.

S. Russell and P. Norvig (1995) Artificial Intelligence; A Modern Approach, Prentice Hall, New Jersey.

S. Russell and P. Norvig. Articial Intelligence. A Modern Approach. Prentice-Hall, 1995. ISBN 0-13-360124-2.

Page 10: Identity uncertainty and citation matching

Stuart J. Russell and Peter Norvig. Articial Intelligence: A Modern Approach, chapter 17. Number 0-13-103805-2 in Series in Articial Intelligence. Prentice Hall, 1995.

Stuart J. Russell and Peter Norvig. Articial Intelligence A Modern Approach. Prentice Hall, Englewood Cli s, New Jersey, USA, 1995. 32

Morgan Kaufmann Publishers. Russell, S., and Norvig, P. 1995. Artificial Intelligence: A Modern Approach. Prentice Hall.

Stuart J. Russell and Peter Norvig. Articial Intelligence: AModern Approach,chapter 17. Number 0-13-103805-2 in Series in Articial Intelligence. Prentice Hall, 1995.

W. Shavlik and T. G. Dietterich, eds., Morgan Kaufmann, San Mateo, CA. Russell, S. and Norvig, P. (1995). Artificial Intelligence - A Morden Approach. Englewood Cliffs, NJ: Prentice-Hall.

KeyGraph: Automatic indexing by co-occurrence graph based on building construction metaphor. In Advanced Digital Library Conference. to appear. Russell, S., and Norvig, P. 1995. Artificial Intelligence --A Modern Approach--.

Prentice-Hall.

Formal derivation of rule-based programs. IEEE Transactions on Software Engineering 19(3):277--296. Russell, S., and Norvig, P. 1995. Artificial Intelligence: A Modern Approach. Prentice Hall.

Page 11: Identity uncertainty and citation matching

Russell, Stuart and Peter Norvig, Artificial Intelligence, A Modern Approach, New Jersey, Prentice Hall, 1995.

S. Russell, P. Norvig: Articial Intelligence: A modern approach; Prentice Hall (1995).

Rechenberg, I. (89). Artificial evolution and artificial intelligence. In Forsyth, R. (Ed.), Machine Learning, pp. 83--103 London. Chapman. Russell, S., \& Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Prentice Hall.

Russell, S and Norvig, P. 1995. Articial Intelligence: A Modern Approach Prentice-Hall, Englewood Cli s, New Jersey, 1995.

Russell, S., \& Norvig, P. (1995) . Artificial intelligence: A modern monitoring methods for information retrieval systems: From search approach. Prentice-Hall series on artificial intelligence. Upper Saddle product to search process. Journal

of the American Society for Information Science, 47, 568 -- 583. River, NJ: Prentice-Hall.

Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach, chapter 17. Number 0-13-103805-2 in Series in Artificial Intelligence. Prentice Hall, 1995.

S. Russell and P. Norvig. Articial Intelligence A Modern Approach. Prentice Hall, Englewood Cli s, 1995.

Page 12: Identity uncertainty and citation matching

Russell, Stuart and Norvig, Peter: Artificial Intelligence: A Modern Approach, Prentice Hall, Englewood Cliffs NJ, 1995

S. Russell and P. Norvig. ????????? ????????????? ? ?????? ????????. Prentice Hall, Englewood Cli s, NJ, 1995.

S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach - The Intelligent Agent Book, Prentice Hall, NY, 1995.

S. Russell and P. Norvig. Artificial Intelligence-aModern Approach. Prentice Hall International, Englewood Cliffs, NJ,USA,1995.

S.J.Russell, P.Norvig: Arti cial intelligence. A modern approach", Prentice-Hall International, 1995.

In Proceedings of the Third Annual Conference on Evolutionary Programming (pp. 131--139). River Edge, NJ: World Scientific. Russell, S.J., \& Norvig, P. 1995. Artificial Intelligence, A Modern Approach. Englewood Cliffs, NJ: Prentice

Hall.

In Working Notes of the IJCAI-95 Workshop on Entertainment and AI/ALife, 19--24. Russell, S., and Norvig, P. 1995. Artificial Intelligence: A Modern Approach. Prentice Hall.

Page 13: Identity uncertainty and citation matching

Stuart J. Russell and Peter Norvig. Artiilcial Intelligence: A Modern Approach. Prentice Hall, Englewood Cliffs, N J, 1995.

Academic Press. 359--380. Russell, S., and Norvig, P. 1994. Artificial Intelligence: A Modern Approach. Prentice Hall.

Stuart J. Russell, Peter Norvig, Artifical Intelligence: A Modern Appraoch, Prentice-Hall, Englewood Cliffs, New Jersey. 1994.

Cambridge, MA: MIT Press. Russell, S. J., and Norvig, P. (1994). Artificial Intelligence: A Modern Approach. Englewood Cliffs, NJ: Prentice-Hall.

Morgan Kauffman. Russell, S., and Norvig, P. 1994. Artificial Intelligence: A Modern Approach. Prentice Hall.

Fast Plan Generation Through Heuristic Search Russell, S., \& Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs, NJ.

Hoffmann \& Nebel Russell, S., \& Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs, NJ.

Page 14: Identity uncertainty and citation matching

Stuart Russel and Peter Norvig. Artificial Intelligence: A Modern Approach, chapter 12.1 - 12.3, pages 367--380. Prentice Hall, 1995.

Stuart Russel and Peter Norvig. Artificial Intelligence, A Modern Approach. PrenticeHall, 1996. 2

Stuart Russel, Peter Norvig, Articial Intelligence: A Modern Approach, Prentice Hall, New Jersey, US, 1995

Russel, S., and Norvig, P. Articial Intelligence. A Modern Approach. Prentice Hall Series in Artificial Intelligence. 1995.

S. Russel and P. Norvig. Artificial Intelligence, A Modern Approach, Prentice Hall: 1995. Book Details from Amazon or Barnes \& Noble

S. J. Russel and P. Norvig. Articial Intelligence A Modern Approach, chapter 14, pages 426-435. Prentice Hall Series in Articial Intelligence. Prentice Hall International, Inc., London, UK, rst edition, 1995. Exercise 14.3.

Russel, S. and P. Norvig. Articial intelligence: A modern approach, Prentice Hall, 1995. Book Details from Amazon or Barnes \& Noble

Page 15: Identity uncertainty and citation matching

S. Russel and P. Norvig Artificial Intelligence: A Modern Approach, MIT Press 1995.

Russel, S. and Norvig, P., "Artificial Intelligence: A Modern Approch," p. 111-114, Prentice-Hall.

J. Russel and P. Norvig. Artificial Intelligence, A Modern Approach. Prentice Hall, Upper Saddle River, NJ, 1995. 71

Stuart Russel and Peter Norvig. A Modern, Agent-Oriented Approach to Introductory Artificial Intelligence. 1995.

Stuart J. Russel and Peter Norvig. Artificial Intelligence---A Modern Approach, chapter 14, pages 426--435. Prentice Hall Series in Artificial Intelligence. Prentice Hall Internationall, Inc., London, UK, first edition, 1995. Excersice 14.3.

Russel S. and Norvig P. (1995). Articial Intelligence. A Modern Approach. Prentice Hall Series in Artificial Intelligence.

S. Russel, P. Norvig Articial Intelligence - A Modern Approach Prentice Hall, 1995

Russel, S., P. Norvig. Artificial Intelligence: A Modern Approach Prentice Hall 1995.

Page 16: Identity uncertainty and citation matching

Artificial Intelligence, S Russel \& P Norvig, Prentice Hall, 1995 21

Russel, S.J, Norvig P: Artificial Intelligence. A Modern Approach, Prentice Hall Inc. 1995

Russel, S., Norvig, P. (1995) Artificial Intellience - A modern approach. (Englewood Cliffs: Prentice Hall International).

Page 17: Identity uncertainty and citation matching

Key contributions

A “vertically integrated,” declarative, generative model from objects, relations, and interpretations all the way down to text strings Clear distinction between entities and strings Probability of identity calculated as a direct

consequence of the model (no “similarity heuristics”) Semantically driven “collective disambiguation”

during parsing occurs automatically MCMC over possible worlds (object existence and

relational structures), scaled linearly with #citations

17

Page 18: Identity uncertainty and citation matching

Citation information extraction

Given a set of citation strings scraped from online papers

Determine What distinct papers exist (including offline papers) What authors (venues, publishers, etc.) exist What the titles and authors (etc.) of the papers are Which paper cites which paper

18

Page 19: Identity uncertainty and citation matching

19

Example[Lashkari et al 94] Collaborative Interface Agents, Yezdi

Lashkari, Max Metral, and Pattie Maes, Proceedings of the Twelfth National Conference on Articial Intelligence, MIT Press, Cambridge, MA, 1994.

Metral M. Lashkari, Y. and P. Maes. Collaborative interface agents. In Conference of the American Association for Artificial Intelligence, Seattle, WA, August 1994.

Are these descriptions of the same object? Who are the authors? What is the venue?

Core task in CiteSeer, Google Scholar, over 300 record linkage companies

CiteSeer (2001) asserted that Russell and Norvig wrote over 100 distinct books

Page 20: Identity uncertainty and citation matching

Relational probability models

Taxonomic hierarchy of object classes (fixed) Named instances in each class (fixed) Complex attributes A denote typed relations between

objects (e.g., AuthorOf(author,paper)) Paper allows for nonfunctional relations and uncertainty over

the (bounded) number #[A] of values for A for each object

Simple attributes B denote typed fixed-range functions Probability models P(B | Parents(B)) where Parents(B)

are attribute chains A1.….An.B’ E.g., height ~ mother.height, father.height

20

Page 21: Identity uncertainty and citation matching

RPM for citations

21

Page 22: Identity uncertainty and citation matching

Priors for object attributes

22

Names: 0.9 x (~ US census) +0.1 x letter bigramTitles:Word+letter bigrams fromAI BibTeX database#authors, #fnames, pubType:Counts in hand-parsed training set of 500 citations

Page 23: Identity uncertainty and citation matching

Observation models

23

Corruption of names:Letter deletion/insertion/changeFirst name(s) replaced by initialsOmission/reordering of authorsCorruption of titles:Letter and word deletion/insertion/replacement

Parameters learned online by EMwithout requiring ground truth

Page 24: Identity uncertainty and citation matching

Models for Citation.parse and Citation.text

Range of parse attribute has two parts: Style specifies order between title and authors plus how to

write the authors (J. Smith vs Smith, J.). Prior estimated from hand-segmented training set.

Segmentation (tiny grammar) specifies cut points: <filler1>|<title>|<filler2>|<authors>|<filler3>

[A50] “|I, Robot|.” By |I. Asimov|. Gnome Press, 1950. Prior for cut points is uniform Filler: letter + word bigrams estimated from training set Filler 3 depends on PubType

Style, segmentation, obsTitle, names of obsAuthors, and fillers jointly determine text

24

Page 25: Identity uncertainty and citation matching

Comment

The model contains no “features” or other heuristics related to determining if two citations match each other

25

Page 26: Identity uncertainty and citation matching

Identity uncertainty

We name papers P1, P2, … cited by C1, C2, …

Uncertainty as to whether P1 = P2 etc. If P1 = P2, just one title, author list; otherwise two

Possible world includes an assignment ι that specifies equivalence classes of papers: {{P1},{P2}} vs. {{P1,P2}}

Probability model factors as P(ι)P(world | ι) P(ι) = ΠCP(ιC); “known” classes (e.g., citations)

are fixed sets of singletons (e.g., {{C1},…,{Ck}}) 26

Page 27: Identity uncertainty and citation matching

Identity uncertainty contd.

Specify prior on number of objects |ιC| E.g., number of authors ~ lognormal[6.9,2.3]

Make appropriate a priori uniformity assumption Each paper equally likely to be cited Each author equally likely to write a paper

Suppose ιC,k,m maps k named instances to m distinct objects in C out of n total; then P(ιC,k,m) = n! / (n-m)!nk

27

Page 28: Identity uncertainty and citation matching

Identity uncertainty contd.

Why does the number of objects matter? Suppose, for any object, we observe just one

attribute: an integer in the range [1,…,1 000 000] Prior distribution for values is uniform Observed objects A and B both have value 526881 What is the probability that A=B?

28

Page 29: Identity uncertainty and citation matching

Identity uncertainty contd.

Why does the number of objects matter? Suppose, for any object, we observe just one

attribute: an integer in the range [1,…,1 000 000] Prior distribution for values is uniform Observed objects A and B both have value 526881 What is the probability that A=B?

If there are a thousand objects, 0.999

29

Page 30: Identity uncertainty and citation matching

Identity uncertainty contd.

Why does the number of objects matter? Suppose, for any object, we observe just one

attribute: an integer in the range [1,…,1 000 000] Prior distribution for values is uniform Observed objects A and B both have value 526881 What is the probability that A=B?

If there are a thousand objects, 0.999 If there a billion objects, 0.001

30

Page 31: Identity uncertainty and citation matching

Identity uncertainty contd.

Why does the number of objects matter? Suppose, for any object, we observe just one

attribute: an integer in the range [1,…,1 000 000] Prior distribution for values is uniform Observed objects A and B both have value 526881 What is the probability that A=B?

If there are a thousand objects, 0.999 If there a billion objects, 0.001

Probability of identity is affected by Number of objects, size of measurement space Measurement accuracy, similarity between objects 31

Page 32: Identity uncertainty and citation matching

Inference

Given a fixed assignment ι a fixed “relational skeleton” (complex attributes)

RPM can be grounded as a Bayes net

32

Page 33: Identity uncertainty and citation matching

Inference contd.

Metropolis-Hastings MCMC First part of proposal is a split/merge on ιC :

Drop two clusters of citations (i.e., papers) Pa, Pb

Add two empty clusters P1, P2

Put citations from Pa, Pb u.a.r. into P1, P2

Second part fills in attribute values: Names, title: apply perturbation model in reverse

(and occasionally pick a nearby census name) Parse: sample directly from precomputed list

33

Page 34: Identity uncertainty and citation matching

MCMC inference

34

Page 35: Identity uncertainty and citation matching

Scaling

Naïve M-H merge proposals fail more frequently for larger worlds

Canopies (McCallum et al KDD-00; see also blocking methods in data deduplication): sets of entities that have >ε chance of matching according to a simple distance metric (i.e., don’t try merging “Smith” and “Baeza-Yates”).

Pick a canopy first, then a pair of papers in it Runtime scales

35

Page 36: Identity uncertainty and citation matching

36

Citation Matching Results

Four data sets of ~300-500 citations, referring to ~150-300 papers

0

0.05

0.1

0.15

0.2

0.25

Reinforce Face Reason Constraint

Error

(Fraction of Clusters Not Recovered Correctly)

Phrase Matching[Lawrence et al. 1999]

Generative Model + MCMC[Pasula et al. 2002]

Conditional Random Field[Wellner et al. 2004]

Page 37: Identity uncertainty and citation matching

Cross-Citation DisambiguationWauchope, K. Eucalyptus: Integrating Natural Language Input with a Graphical User Interface. NRL Report NRL/FR/5510-94-9711 (1994).

Is "Eucalyptus" part of the title, or is the author named K. Eucalyptus Wauchope?

Page 38: Identity uncertainty and citation matching

Cross-Citation DisambiguationWauchope, K. Eucalyptus: Integrating Natural Language Input with a Graphical User Interface. NRL Report NRL/FR/5510-94-9711 (1994).

Is "Eucalyptus" part of the title, or is the author named K. Eucalyptus Wauchope?

Kenneth Wauchope (1994). Eucalyptus: Integrating natural language input with a graphical user interface. NRL Report NRL/FR/5510-94-9711, Naval Research Laboratory, Washington, DC, 39pp.

Second citation makes it clear how to parse the first one

Page 39: Identity uncertainty and citation matching

Cross-citation disambiguation

Later experiments (Milch) showed 33% of singleton citations are parsed correctly 64% of non-singleton citations are parsed correctly

39

Page 40: Identity uncertainty and citation matching

What about Russell and Norvig?

Still generates several clusters: Correct version Introduction to Artificial Intelligence S. Russel and P. Norvig, Artificial Intelligence Etc.

Copying of erroneous citations produces data whose best explanation has multiple clusters

Could add a copying model, plus times of citing and cited papers

40

Page 41: Identity uncertainty and citation matching

Summary

A somewhat nontrivial application of a declarative, relational, open-universe language

Relatively simple model for existence – did not allow for dependence on other objects

Inference results suggest scalability is not the huge barrier for model-based methods

Later work by Pasula raised accuracy to ~99% (comparable to humans); errors were mainly on partial or concatenated citation strings

Building the model was initially difficult41