1
Identity uncertainty and citation matching
Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.
NIPS 2002
Some relevant news…
Facebook rolled out Graph Search on Tuesday. The natural-language search engine can, for example, search for "music that people who like Mitt Romney like," or "photos of my friends in 1989." Although it's limited to four topics at the moment - people, places, interests and photos - the queries that users can run against the millions of photos and connections stored among Facebook's billion users are powerful, powerful tools. … Bing - like Google - has begun to try and provide answers to questions, rather than lists of links. But that's the same territory Zuckerberg and Co. have staked out, too.
2
3
Outline Background of research Key contributions Citation matching and information extraction Identity uncertainty Generative model Experimental results Implications for information extraction more generally
Background of research
Record linkage (Felegi & Sunter 1969): Naïve Bayes model for record-pair match/mismatch vector
given entity match/mismatch Trained on matched and unmatched pairs Sensitive to population sizes in train/test
Bayesian analysis of identity: Data association literature (multitarget tracking) Huang and R 97 (freeway surveillance)
Previous work on RPMs (Koller and Pfeffer) Previous work on MCMC for RPM++ (Pasula & R 01) CiteSeer not working too well
4
CiteSeer02: Russell w/4 Norvig Russell S, Norvig P (1995) Artificial Intelligence: A Modern Approach,
Prentice Hall Series in Artificial Intelligence. Englewood Cliffs, New Jersey
Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach, Prentice Hall, 1995.
Russell S.; Norvig, P. Articial Intelligence - A Modern Approach. Prentice-Hall International Editions, 1995.
Russell S.J., Norvig P., (1995) Artificial Intelligence, A Modern Approach. Prentice Hall.
S. Russell and P. Norvig. Articial Intelligence, a Modern Approach. Prentice Hall, New Jersey, NJ, 1995.
Stuart Russell and Peter Norvig. Artificial intelligence: A modern approach. Prentice-Hall Series on Artificial Intelligence. Prentice-Hall, Englewood Cliffs, New Jersey, 1995.
S. Russell and P Norvig. Artifical Intelligence: a Modern Approach. Prentice Hall, 1995. Book Details from Amazon or Barnes \& Noble
Stuart Russell and Peter Norvig. Articial Intelligence: A Modern Approach. Prentice Hall, 1995.
S. J. Russell and P. Norvig. Artificial Intelligence, a modern approach. Prentice Hall, Upper Saddle River, New Jersey 07458, 1995.
Stuart Russell and Peter Norvig. Artificial Intelligence. A modern approach. Prentice-Hall, 1995.
S. J. Russell and P. Norvig. Articial Intelligence: A Modern Approach. Prentice Hall. 1995.
S. Russell and P. Norvig, Artificial Intelligence A Modern Approach Prentice Hall 1995.
S. Russell and P. Norvig. Introduction to Artificial Intelligence. Prentice Hall, 1995.
Stuart Russell and Peter Norvig. Artficial Intelligence: A Modern Approach. Prentice-Hall, Saddle River, NJ, 1995.
Stuart Russell and Peter Norvig. Articial Intelligence a modern approach. Prentice Hall series in articial intelligence. Prentice Hall, Upper Saddle River, New Jersey, 1995.
Chapter 18 Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig, Prentice-Hall, 2000.
Dynamics of computational ecosystems. Physical Review A 40:404--421. Russell, S., and Norvig, P. 1995. Artificial Intelligence: A Modern Approach. Prentice Hall.
S. Russell, P. Norvig: Artificial Intelligence -- A Modern Approach, Prentice Hall, 1995.
Russell, S. \& Norvig, P. (1995) Artificial Intelligence: A Modern Appraoch (Englewood Cliffs, NJ: Prentice-Hall). Book Details from Amazon or Barnes \& Noble
Stuart Russell and Peter Norvig. AI: A Modern Approach. Prentice Hall, NJ, 1995.
S. Russell, P. Norvig. Artificial Intelligence: A Modem Approach. Prentice- Hall, Inc., 1995.
391-414. Russell SJ, Norvig P (
Russell and Peter Norvig, "Artificial Intelligence - A Modern Approach (AIMA)", pp. 33
Stuart Russell and Peter Norvig: Artificial Intelligence: A Modern Approach, Prentice-Hall, 1994.
Russell, S. \& Norvig, P., An Introduction to Artificial Intelligence: A Modern Approach, Prentice Hall International, 1996.
S. Russell, P. Norvig. Artician Intelligence. A modern approach. Prentice Hall, 1995.
Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 1995. Contributing writers: John F. Canny, Jitendra M. Malik, Douglas D. Edwards. ISBN 0-13-103805-2.
Stuart Russell and Peter Norvig. Artificial Intelligence: A Mordern Approach. Prentice Hall, Englewood Cliffs, New Jersey 07632, 1995.
In Proceedings of the Third Annual Conference on Evolutionary Programming (pp. 131--139). River Edge, NJ: World Scientific. Russell, S.J., \& Norvig, P. 1995. Artificial Intelligence, A Modern Approach. Englewood Cliffs, NJ: Prentice Hall.
John Wiley. Russell, S., \& Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Prentice-Hall, Inc.
Stuart Russell and Peter Norvig: Artifcial Intelligence A Modern Approach, Englewood Clioes, NJ: Prentice Hall, 1995.
In Scherer, K.R. \& Ekman, P. Approaches to Emotion, 13--38. Hillsdale, NJ: Lawrence Erlbaum. Russell, S.J. and Norvig, P. 1995. Artificial Intelligent: A Modern Approach. Englewood Cliffs, NJ: Prentice Hall.
Rosales E, Forthcoming Masters dissertation, Department of Computer Science, University of Essex, Colchester UK Russell S and Norvig P (1995) Artificial Intelligence: A Modern Approach. Prentice Hall: Englewood Cliffs, New Jersey.
S. Russell and P. Norvig (1995) Artificial Intelligence; A Modern Approach, Prentice Hall, New Jersey.
S. Russell and P. Norvig. Articial Intelligence. A Modern Approach. Prentice-Hall, 1995. ISBN 0-13-360124-2.
Stuart J. Russell and Peter Norvig. Articial Intelligence: A Modern Approach, chapter 17. Number 0-13-103805-2 in Series in Articial Intelligence. Prentice Hall, 1995.
Stuart J. Russell and Peter Norvig. Articial Intelligence A Modern Approach. Prentice Hall, Englewood Cli s, New Jersey, USA, 1995. 32
Morgan Kaufmann Publishers. Russell, S., and Norvig, P. 1995. Artificial Intelligence: A Modern Approach. Prentice Hall.
Stuart J. Russell and Peter Norvig. Articial Intelligence: AModern Approach,chapter 17. Number 0-13-103805-2 in Series in Articial Intelligence. Prentice Hall, 1995.
W. Shavlik and T. G. Dietterich, eds., Morgan Kaufmann, San Mateo, CA. Russell, S. and Norvig, P. (1995). Artificial Intelligence - A Morden Approach. Englewood Cliffs, NJ: Prentice-Hall.
KeyGraph: Automatic indexing by co-occurrence graph based on building construction metaphor. In Advanced Digital Library Conference. to appear. Russell, S., and Norvig, P. 1995. Artificial Intelligence --A Modern Approach--.
Prentice-Hall.
Formal derivation of rule-based programs. IEEE Transactions on Software Engineering 19(3):277--296. Russell, S., and Norvig, P. 1995. Artificial Intelligence: A Modern Approach. Prentice Hall.
Russell, Stuart and Peter Norvig, Artificial Intelligence, A Modern Approach, New Jersey, Prentice Hall, 1995.
S. Russell, P. Norvig: Articial Intelligence: A modern approach; Prentice Hall (1995).
Rechenberg, I. (89). Artificial evolution and artificial intelligence. In Forsyth, R. (Ed.), Machine Learning, pp. 83--103 London. Chapman. Russell, S., \& Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Prentice Hall.
Russell, S and Norvig, P. 1995. Articial Intelligence: A Modern Approach Prentice-Hall, Englewood Cli s, New Jersey, 1995.
Russell, S., \& Norvig, P. (1995) . Artificial intelligence: A modern monitoring methods for information retrieval systems: From search approach. Prentice-Hall series on artificial intelligence. Upper Saddle product to search process. Journal
of the American Society for Information Science, 47, 568 -- 583. River, NJ: Prentice-Hall.
Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach, chapter 17. Number 0-13-103805-2 in Series in Artificial Intelligence. Prentice Hall, 1995.
S. Russell and P. Norvig. Articial Intelligence A Modern Approach. Prentice Hall, Englewood Cli s, 1995.
Russell, Stuart and Norvig, Peter: Artificial Intelligence: A Modern Approach, Prentice Hall, Englewood Cliffs NJ, 1995
S. Russell and P. Norvig. ????????? ????????????? ? ?????? ????????. Prentice Hall, Englewood Cli s, NJ, 1995.
S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach - The Intelligent Agent Book, Prentice Hall, NY, 1995.
S. Russell and P. Norvig. Artificial Intelligence-aModern Approach. Prentice Hall International, Englewood Cliffs, NJ,USA,1995.
S.J.Russell, P.Norvig: Arti cial intelligence. A modern approach", Prentice-Hall International, 1995.
In Proceedings of the Third Annual Conference on Evolutionary Programming (pp. 131--139). River Edge, NJ: World Scientific. Russell, S.J., \& Norvig, P. 1995. Artificial Intelligence, A Modern Approach. Englewood Cliffs, NJ: Prentice
Hall.
In Working Notes of the IJCAI-95 Workshop on Entertainment and AI/ALife, 19--24. Russell, S., and Norvig, P. 1995. Artificial Intelligence: A Modern Approach. Prentice Hall.
Stuart J. Russell and Peter Norvig. Artiilcial Intelligence: A Modern Approach. Prentice Hall, Englewood Cliffs, N J, 1995.
Academic Press. 359--380. Russell, S., and Norvig, P. 1994. Artificial Intelligence: A Modern Approach. Prentice Hall.
Stuart J. Russell, Peter Norvig, Artifical Intelligence: A Modern Appraoch, Prentice-Hall, Englewood Cliffs, New Jersey. 1994.
Cambridge, MA: MIT Press. Russell, S. J., and Norvig, P. (1994). Artificial Intelligence: A Modern Approach. Englewood Cliffs, NJ: Prentice-Hall.
Morgan Kauffman. Russell, S., and Norvig, P. 1994. Artificial Intelligence: A Modern Approach. Prentice Hall.
Fast Plan Generation Through Heuristic Search Russell, S., \& Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs, NJ.
Hoffmann \& Nebel Russell, S., \& Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs, NJ.
Stuart Russel and Peter Norvig. Artificial Intelligence: A Modern Approach, chapter 12.1 - 12.3, pages 367--380. Prentice Hall, 1995.
Stuart Russel and Peter Norvig. Artificial Intelligence, A Modern Approach. PrenticeHall, 1996. 2
Stuart Russel, Peter Norvig, Articial Intelligence: A Modern Approach, Prentice Hall, New Jersey, US, 1995
Russel, S., and Norvig, P. Articial Intelligence. A Modern Approach. Prentice Hall Series in Artificial Intelligence. 1995.
S. Russel and P. Norvig. Artificial Intelligence, A Modern Approach, Prentice Hall: 1995. Book Details from Amazon or Barnes \& Noble
S. J. Russel and P. Norvig. Articial Intelligence A Modern Approach, chapter 14, pages 426-435. Prentice Hall Series in Articial Intelligence. Prentice Hall International, Inc., London, UK, rst edition, 1995. Exercise 14.3.
Russel, S. and P. Norvig. Articial intelligence: A modern approach, Prentice Hall, 1995. Book Details from Amazon or Barnes \& Noble
S. Russel and P. Norvig Artificial Intelligence: A Modern Approach, MIT Press 1995.
Russel, S. and Norvig, P., "Artificial Intelligence: A Modern Approch," p. 111-114, Prentice-Hall.
J. Russel and P. Norvig. Artificial Intelligence, A Modern Approach. Prentice Hall, Upper Saddle River, NJ, 1995. 71
Stuart Russel and Peter Norvig. A Modern, Agent-Oriented Approach to Introductory Artificial Intelligence. 1995.
Stuart J. Russel and Peter Norvig. Artificial Intelligence---A Modern Approach, chapter 14, pages 426--435. Prentice Hall Series in Artificial Intelligence. Prentice Hall Internationall, Inc., London, UK, first edition, 1995. Excersice 14.3.
Russel S. and Norvig P. (1995). Articial Intelligence. A Modern Approach. Prentice Hall Series in Artificial Intelligence.
S. Russel, P. Norvig Articial Intelligence - A Modern Approach Prentice Hall, 1995
Russel, S., P. Norvig. Artificial Intelligence: A Modern Approach Prentice Hall 1995.
Artificial Intelligence, S Russel \& P Norvig, Prentice Hall, 1995 21
Russel, S.J, Norvig P: Artificial Intelligence. A Modern Approach, Prentice Hall Inc. 1995
Russel, S., Norvig, P. (1995) Artificial Intellience - A modern approach. (Englewood Cliffs: Prentice Hall International).
Key contributions
A “vertically integrated,” declarative, generative model from objects, relations, and interpretations all the way down to text strings Clear distinction between entities and strings Probability of identity calculated as a direct
consequence of the model (no “similarity heuristics”) Semantically driven “collective disambiguation”
during parsing occurs automatically MCMC over possible worlds (object existence and
relational structures), scaled linearly with #citations
17
Citation information extraction
Given a set of citation strings scraped from online papers
Determine What distinct papers exist (including offline papers) What authors (venues, publishers, etc.) exist What the titles and authors (etc.) of the papers are Which paper cites which paper
18
19
Example[Lashkari et al 94] Collaborative Interface Agents, Yezdi
Lashkari, Max Metral, and Pattie Maes, Proceedings of the Twelfth National Conference on Articial Intelligence, MIT Press, Cambridge, MA, 1994.
Metral M. Lashkari, Y. and P. Maes. Collaborative interface agents. In Conference of the American Association for Artificial Intelligence, Seattle, WA, August 1994.
Are these descriptions of the same object? Who are the authors? What is the venue?
Core task in CiteSeer, Google Scholar, over 300 record linkage companies
CiteSeer (2001) asserted that Russell and Norvig wrote over 100 distinct books
Relational probability models
Taxonomic hierarchy of object classes (fixed) Named instances in each class (fixed) Complex attributes A denote typed relations between
objects (e.g., AuthorOf(author,paper)) Paper allows for nonfunctional relations and uncertainty over
the (bounded) number #[A] of values for A for each object
Simple attributes B denote typed fixed-range functions Probability models P(B | Parents(B)) where Parents(B)
are attribute chains A1.….An.B’ E.g., height ~ mother.height, father.height
20
RPM for citations
21
Priors for object attributes
22
Names: 0.9 x (~ US census) +0.1 x letter bigramTitles:Word+letter bigrams fromAI BibTeX database#authors, #fnames, pubType:Counts in hand-parsed training set of 500 citations
Observation models
23
Corruption of names:Letter deletion/insertion/changeFirst name(s) replaced by initialsOmission/reordering of authorsCorruption of titles:Letter and word deletion/insertion/replacement
Parameters learned online by EMwithout requiring ground truth
Models for Citation.parse and Citation.text
Range of parse attribute has two parts: Style specifies order between title and authors plus how to
write the authors (J. Smith vs Smith, J.). Prior estimated from hand-segmented training set.
Segmentation (tiny grammar) specifies cut points: <filler1>|<title>|<filler2>|<authors>|<filler3>
[A50] “|I, Robot|.” By |I. Asimov|. Gnome Press, 1950. Prior for cut points is uniform Filler: letter + word bigrams estimated from training set Filler 3 depends on PubType
Style, segmentation, obsTitle, names of obsAuthors, and fillers jointly determine text
24
Comment
The model contains no “features” or other heuristics related to determining if two citations match each other
25
Identity uncertainty
We name papers P1, P2, … cited by C1, C2, …
Uncertainty as to whether P1 = P2 etc. If P1 = P2, just one title, author list; otherwise two
Possible world includes an assignment ι that specifies equivalence classes of papers: {{P1},{P2}} vs. {{P1,P2}}
Probability model factors as P(ι)P(world | ι) P(ι) = ΠCP(ιC); “known” classes (e.g., citations)
are fixed sets of singletons (e.g., {{C1},…,{Ck}}) 26
Identity uncertainty contd.
Specify prior on number of objects |ιC| E.g., number of authors ~ lognormal[6.9,2.3]
Make appropriate a priori uniformity assumption Each paper equally likely to be cited Each author equally likely to write a paper
Suppose ιC,k,m maps k named instances to m distinct objects in C out of n total; then P(ιC,k,m) = n! / (n-m)!nk
27
Identity uncertainty contd.
Why does the number of objects matter? Suppose, for any object, we observe just one
attribute: an integer in the range [1,…,1 000 000] Prior distribution for values is uniform Observed objects A and B both have value 526881 What is the probability that A=B?
28
Identity uncertainty contd.
Why does the number of objects matter? Suppose, for any object, we observe just one
attribute: an integer in the range [1,…,1 000 000] Prior distribution for values is uniform Observed objects A and B both have value 526881 What is the probability that A=B?
If there are a thousand objects, 0.999
29
Identity uncertainty contd.
Why does the number of objects matter? Suppose, for any object, we observe just one
attribute: an integer in the range [1,…,1 000 000] Prior distribution for values is uniform Observed objects A and B both have value 526881 What is the probability that A=B?
If there are a thousand objects, 0.999 If there a billion objects, 0.001
30
Identity uncertainty contd.
Why does the number of objects matter? Suppose, for any object, we observe just one
attribute: an integer in the range [1,…,1 000 000] Prior distribution for values is uniform Observed objects A and B both have value 526881 What is the probability that A=B?
If there are a thousand objects, 0.999 If there a billion objects, 0.001
Probability of identity is affected by Number of objects, size of measurement space Measurement accuracy, similarity between objects 31
Inference
Given a fixed assignment ι a fixed “relational skeleton” (complex attributes)
RPM can be grounded as a Bayes net
32
Inference contd.
Metropolis-Hastings MCMC First part of proposal is a split/merge on ιC :
Drop two clusters of citations (i.e., papers) Pa, Pb
Add two empty clusters P1, P2
Put citations from Pa, Pb u.a.r. into P1, P2
Second part fills in attribute values: Names, title: apply perturbation model in reverse
(and occasionally pick a nearby census name) Parse: sample directly from precomputed list
33
MCMC inference
34
Scaling
Naïve M-H merge proposals fail more frequently for larger worlds
Canopies (McCallum et al KDD-00; see also blocking methods in data deduplication): sets of entities that have >ε chance of matching according to a simple distance metric (i.e., don’t try merging “Smith” and “Baeza-Yates”).
Pick a canopy first, then a pair of papers in it Runtime scales
35
36
Citation Matching Results
Four data sets of ~300-500 citations, referring to ~150-300 papers
0
0.05
0.1
0.15
0.2
0.25
Reinforce Face Reason Constraint
Error
(Fraction of Clusters Not Recovered Correctly)
Phrase Matching[Lawrence et al. 1999]
Generative Model + MCMC[Pasula et al. 2002]
Conditional Random Field[Wellner et al. 2004]
Cross-Citation DisambiguationWauchope, K. Eucalyptus: Integrating Natural Language Input with a Graphical User Interface. NRL Report NRL/FR/5510-94-9711 (1994).
Is "Eucalyptus" part of the title, or is the author named K. Eucalyptus Wauchope?
Cross-Citation DisambiguationWauchope, K. Eucalyptus: Integrating Natural Language Input with a Graphical User Interface. NRL Report NRL/FR/5510-94-9711 (1994).
Is "Eucalyptus" part of the title, or is the author named K. Eucalyptus Wauchope?
Kenneth Wauchope (1994). Eucalyptus: Integrating natural language input with a graphical user interface. NRL Report NRL/FR/5510-94-9711, Naval Research Laboratory, Washington, DC, 39pp.
Second citation makes it clear how to parse the first one
Cross-citation disambiguation
Later experiments (Milch) showed 33% of singleton citations are parsed correctly 64% of non-singleton citations are parsed correctly
39
What about Russell and Norvig?
Still generates several clusters: Correct version Introduction to Artificial Intelligence S. Russel and P. Norvig, Artificial Intelligence Etc.
Copying of erroneous citations produces data whose best explanation has multiple clusters
Could add a copying model, plus times of citing and cited papers
40
Summary
A somewhat nontrivial application of a declarative, relational, open-universe language
Relatively simple model for existence – did not allow for dependence on other objects
Inference results suggest scalability is not the huge barrier for model-based methods
Later work by Pasula raised accuracy to ~99% (comparable to humans); errors were mainly on partial or concatenated citation strings
Building the model was initially difficult41