Upload
deirdre-oliver
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Linking and Summarizing Information on Entities
Presented by
Min-Yen Kan
Web IR / NLP Group (WING)
Department of Computer ScienceNational University of Singapore, Singapore
This talk archived as http://wing.comp.nus.edu.sg/~kanmy/talks/080407-nihLMC.htm
2NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Singapore, the garden city• 4M+ people, sandwiched between Malaysia and Indonesia
• 50 km from the equator: hot and humid year-long
• Known for: urban planning, fondness for acronyms and aversion to bubble gum litterers :-D
WING @ NUShttp://wing.comp.nus.edu.sg
• 1 postdoc, 6 Ph.D. students, 5 undergraduates • Projects of in natural language processing, digital libraries, and information retrieval.
3NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Entity Centric Information Management“Collate all studies on SBP2 that new findings in the last year.”
“Oh, I meant the PROTEIN SBP2, not the gene.”
“What other proteins does SBP2 bind to?”
“Tell me more about the contradiction from previous results.”
“Which Miller did the study on SBP2 in 2002?”
4NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Entity Centric Information Management
Two consequences to discuss today:
• LinkageJoint work with Yee Fan TAN, Dongwon LEE (PSU) et al.
• SummarizationJoint work with Ziheng LIN et al.
5NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
What’s Entity Linkage?Aggregating data on an object together from heterogeneous resources
Problem: Entity names are ambiguous!
–Medical terms–Person names–Products–Customer records
These problems exist even when we have controlled vocabulary and lexicons (Specialist, UMLS, MeSH)
By UV cross-linking and immunoprecipitation, we show that SBP2 specifically binds selenoprotein mRNAs both in vitro and in vivo.
The SBP2 clone used in this study generates a 3173 nt transcript (2541 nt of coding sequence plus a 632 nt 3’ UTR truncated at the polyadenylation site).
Gene
Protein
6NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Examples of Split RecordsDongwon Lee, 110 E. Foster
Ave. #410, State College, PA, 16802
Honda Fix
Joint Conf. on Digital Libraries
Apple iPod Nano 4GB
Entity Linkage
LEE Dong, 110 East Foster Avenue Apartment 410, University Park, PA 16802-2343
Honda Jazz
JCDL
4GB iPod nano 4GB
De-duplicationIronic, isn’t it?
7NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
All over the web!
Jeffrey D. Ullman(Stanford University)
8NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Record linkage, formally defined• Input
– Two lists of records, A and B
• Output– For each record a in A and for each record b in B,does a and b refer to the same entity?
• Note: – Entities do not come with unique identifiers– To disambiguate (deduplicate) items in a single list L, we set A = B = L
9NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Talk Outline
• Linkage using the Web– Introduction
>> Record linkage using internal knowledgeString matching
Classification or clustering
Graphical formalisms
Blocking
– Record linkage using search engines
• Update Summarization
10NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Fellegi-Sunter model
* true matches
○ true non-matches
false matchesfalse non-matches
no-decision region(hold for human review)
designate asdefinite match
designate asdefinite non-match
Similarity (a, b)
Fre
que
ncy
of S
imila
rity
11NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
String matching
•String similarity–Strings as ordered sequences
Edit distance
Jaro and Jaro-Winkler
–Strings as unordered setsJaccard similarity
Cosine similarity
•Abbreviation matching– Pattern detection: e.g. “National Institute of Health (NIH)”
([a], [b], [c]) ≠ ([c], [b], [a])
{[a], [b], [c]} = {[c], [b], [a]}
12NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Machine Learning– Create features
String similarity, relationships (e.g. collaborators)
– Then learn a modelNaïve Bayes, Support Vector Machine, K-means,
Agglomerative Clustering, …
Yoojin Hong, Byung-Won On and Dongwon Lee. SystemSupport for Name Authority Control Problem in
Digital Libraries: OpenDBLP Approach. ECDL 2004.
Sudha Ram, Jinsoo Park and Dongwon Lee. DigitalLibraries for the Next Millennium: Challenges and
Research Directions. Information Systems Frontiers 1999.
Same Person?
13NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Graphical Methods: Social network analysis
• Nodes: entities • Edges: relationships
Y. Wang
M.-Y. Kan
D. Hsu
J. C. Latombe
A. Dhanik
Y. F. TanL. Qiu
T.-S. Chua
T.-H. Chiang
H. CuiAnalysis
Connected componentsDistance between nodesNode/edge centralityCliquesBipartite subgraphs…
14NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Talk Outline
• Linkage using the Web– Introduction
– Record linkage using internal knowledge
>> Record linkage using search enginesSearch Engine Features
Adaptive Queries
Query Probing
• Update Summarization
15NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Record linkage using search engines
Previously…– We assumed input data records contain sufficient information to perform linkage
What if…– There is insufficient or only noisy information?
– e.g., linking short forms to long forms
Ask other people!– I.e., consult external (vs. internal) sources of knowledge
– Use web as collective knowledge base
16NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Anatomy of Search Engine Results
Number of results
Ranked list
Snippet
URL
Title
Web page
Programmatically accessible through APIs
17NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Derivable Features• Counts
– Co-occurrence measure between count(q1), count(q2) and count(q1 and q2)
• Hyperlinkage – Count of web pages of q1 point to pages of q2, and vice versa?– Incorporate additional indirect links with less weight(e.g., q 1 p q2)
• Snippets or web pages– (Cosine) similarity using tokens
– Counts of specific terms
e.g. number of snippets for q1 containing the string q2
– Further natural language processing
18NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Web page features
• Named entities (NE)– We consider people, organizations, locations
– Each NE token a feature
• NE-targeted (NE-T)– Motivation: middle names and titles
– For NEs having a token of target name• Extract tokens that are not in target name as features
Born Edward Charles Morrice Fox in Chelsea,London…
Charles, Chelsea, Morrice,Edward, Fox, London, …
Charles, Morrice, …
19NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Using URLs
Where web pages are located is also useful
Hypothesis: If web pages of q1 and web pages of q2 overlap a lot, q1 and q2 are the same entity
Measure this using URL / Host information
• Caveat: Not all hosts are equally telling
– citeseer vs. harvard.edu for author names – pubmed vs. diabetes-info.com for diabetic terms
• Solution: Weight by Inverse Host Frequency
20NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
URL Features (cont.)
• Page URLs
Hypothesis: URL itself tells quite a lot• Home page of “lindek”• CS department, University of Alberta, Canada
– MeURLin (Kan and Nguyen Thi, 2005)• Tokens (http, www, cs, ualberta, ca, lindek)• URI parts (scheme:http, hostname:cs, user:lindek, …)• N-grams (ca ualberta, uaberta cs, cs www, www lindek)• Length of tokens• …
http://www.cs.ualberta.ca/~lindek/
21NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Web search engine linkageTest whether q1 and q2 should be linked
Hypothesis: Web pages of q1 and web pages of q2 share some representative data I
Similar to disconnected triples:
“Jeffrey D. Ullman” = 384K pgs“Jeffrey D. Ullman” + “aho” =174K pgs
“J. Ullman” = 124K pgs“J. Ullman” + “aho” = 41K pgs
“Shimon Ullman” = 27.3K pgs“Shimon Ullman” + “aho”= 66 pgs
q2
q1
22NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Evaluation - Full web pages in WEPS
• Goal– To compare the usefulness of various features for the
Web People Search Task
• Architecture
Inputweb pages
Featurevectors
Clusters
Cosine similarity+
Single linkhierarchical agglomerative
clustering+
Minimum similaritythreshold
23NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Evaluation
• F(α = 0.5) and similarity threshold 0.2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Tokens(T)
NamedEntities
(NE)
NEtargeted(NE-T)
Host (H) Host +Self (H-S)
Domain(D)
Domain +Self (D-S)
URL (U)
ECDL
Wikipedia
Census
24NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Evaluation - Author Disambiguation
• Dataset– Manually-disambiguated dataset of 24 ambiguous
names in computer science domain– Each ambiguous name represented 2 unique
authors (k = 2) except for one where it represented 3– Each name is attributed to 30 citations on average– Proportion of largest class ranges from 50% to 97%
• Search engine– Google (http://www.google.com/)
25NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Evaluation• Single link performs best
– Good for clustering citations from different publication pages together (some pages list only selected publications)
– Some authors have disparate research areas, not well represented by a centroid vector
• Resolving hostnames to IP addresses give best accuracy
0.827
0.726
0.8050.8070.798
0.811
0.836
0.734
0.812
0.660.680.700.720.740.760.780.800.820.840.86
Single link Complete link Group average
Hostname Domain IP address
Classification accuracyaveraged over all names
26NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Discussion
Per-name accuracies using single link Per-name average number of URLsreturned per citation
27NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Discussion• Apparent correlation between accuracy and average number
of URLs returned per citation– Author names with few URLs tend to fare poorly since results
are mainly aggregator web sites
What’s the cost?– Lots of queries needed– Web page downloads are expensive– Hence, slow
Can we speed this up?
Sure thing…
28NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Query probing
• Consider some publication venues:–Joint Conference on Digital Libraries
–European Conference on Digital Libraries
–Digital Libraries
• Query probing– Use common n-gram “digital libraries” as query probe
– If we can obtain information on all three conferences, we save two queries
29NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Adaptive queryingCombine two methods when needed
• Methods– Ms: stronger method but very slow (e.g. web page similarity)– Mw: weaker method but fast (e.g. host overlap)
• Aim– Accuracy close to Ms
– Significantly reduced running time than Ms
• Algorithm
– Execute Mw
– If heuristic suggests that Mw results are likely incorrect
Execute Ms
30NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Entity Linkage - Conclusion
• Important problem with a rich history• New external methods poll contextual evidence for judgment• Need to combine methods to obtain best aspect of each
31NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Talk Outline
• Linkage using the Web
>> Graph-based Update Summarization
– Introduction
– Timestamped Graphs
– Evaluation and Conclusions
“Now that all this data is linked, how do we process it?’’
32NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Applications of Summarization
Decision Support
Doing Less Work
33NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
More seriously: an exciting challenge ......put a book on the scanner, turn the dial to ‘2 pages’, and read the result...
...download 1000 documents from the web, send them to the summarizer, and select the best ones by reading the summaries of the clusters...
...forward the Japanese email to the summarizer, select ‘1 par’, and skim the translated summary.
…get a weekly digest of new treatments and therapies for pressure ulcers
An update task
34NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Simplifying summarization
Select important sentences verbatim from the input text to form a summary
– Input: A text document with k sentences
– Output: Top n (n << k) sentences with the highest numeric scores (each sentence in the input document is assigned a numeric score)
Extractive Summarization
35NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Summarization
Heuristics for extractive summarization– Cue/stigma phrases– Sentence position (relative to document, section, paragraph)– Sentence length – TF×IDF, TF scores– Similarity (with title, context, query)
Machine learning to tune weights by supervised learning
Recently, graphical representations of text have shed new light on the summarization problem
36NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Revisiting Social Networks: Prestige
One motivation was to model the problem as finding prestige of nodes in a social network
• PageRank: random walk
In summarization, lead to TextRank and LexRank
• Did we leave anything out of our representation for summarization?
Yes, the notion of an evolving network
37NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Social networks change!Natural evolving networks (Dorogovtsev and Mendes, 2001)
– Citation networks: New papers can cite old ones, but the old network is static
– The Web: new pages are added with an old page connecting it to the web graph, old pages may update links
38NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Talk Outline
• Linkage using the Web
• Graph-based Update Summarization– Introduction
>> Timestamped Graphs– Evaluation and Conclusion
39NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Evolutionary models for summarization
Writers and readers often follow conventional rhetorical styles - articles are not written or read in an arbitrary way
Consider the evolution of texts using a very simplistic model
– Writers write from the first sentence onwards in a text
– Readers read from the first sentence onwards of a text
A simple model: sentences get added incrementally to the graph
40NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Timestamped Graph Construction
These assumptions suggest us to iteratively add sentences into the graph in chronological order.
At each iteration, consider which edges to add to the graph.
– For single document: simple and straightforward: add 1st sentence, followed by the 2nd, and so forth, until the last sentence is added
– For multi-document: treat it as multiple instances of single documents, which evolve in parallel; i.e., add 1st sentences of all documents, followed by all 2nd sentences, and so forth
• NB: Doesn’t really model chronological ordering between articles, fix later
41NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Timestamped Graph Construction
Model:
• Documents as columns – di = document i
• Sentences as rows–sj = jth sentence of document
42NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Timestamped Graph Construction• A multi document example
doc1 doc2 doc3
sent1
sent2
sent3
43NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
An example TSG: DUC 2007 D0703A-A
44NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
These are just one instance of TSGs
Let’s generalize and formalize them
Def: A timestamped graph algorithm tsg(M) is a 9-tuple (d, e, u, f,σ, t, i, s, τ) that specifies a resultingalgorithm that takes as input the set of texts M andoutputs a graph G
Properties of nodes
Timestamped Graph Construction
Properties of edges
Input text transformation
function
45NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Edge properties (d, e, u, f)• Edge Direction (d)
– Forward, backward, or undirected
• Edge Number (e)– number of edges to instantiate per timestep
• Edge Weight (u)– weighted or unweighted edges
• Inter-document factor (f)
– penalty factor for links between documents in multi-document sets.
46NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Node properties (σ, t, i, s)• Vertex selection function σ(u, G)
– One strategy: among those nodes not yet connected to u in G, choose the one with highest similarity according to u– Similarity functions: Jaccard, cosine, concept links
(Ye et al.. 2005)
• Text unit type (t)– Most extractive algorithms use sentences as elementary units
• Node increment factor (i) – How many nodes get added at each timestep
• Skew degree (s)– Models how nodes in multi-document graphs are added– Skew degree = how many iterations to wait before adding the 1st sentence of the next document– Skip for today…
47NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Timestamped Graph Construction• Representations
– We can model a number of different algorithms using this 9-tuple formalism:
(d, e, u, f, σ, t, i, s, τ)
– The given toy example:(f, 1, 0, 1, max-cosine-based, sentence, 1, 0, null)
– LexRank graphs:(u, N, 1, 1, cosine-based, sentence, Lmax, 0, null)
N = total number of sentences in the cluster; Lmax = the max document length
i.e., all sentences are added into the graph in one timestep, each connected to all others, and cosine scores are given to edge weights
48NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
System Overview• Sentence splitting
–Detect and mark sentence boundaries–Annotate each sentence with the doc ID and the sentence number –E.g., XIE19980304.0061: 4 March 1998 from Xinhua News; XIE19980304.0061-14: the 14th sentence of this document
• Graph construction–Construct TSG in this phase
49NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
System Overview• Sentence Ranking
– Apply topic-sensitive random walk on the graph to redistribute the weights of the nodes
• Sentence extraction
– Extract the top-ranked sentences
– Two different modified MMR re-rankers are used, depending on whether it is main or update task
50NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Talk Outline
• Linkage using the Web
• Graph-based Update Summarization– Introduction
– Timestamped Graphs
>> Evaluation and Conclusion
51NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Evaluation• Dataset: DUC 2005, 2006 and 2007. • Evaluation tool: ROUGE: n-gram based automatic evaluation• Each dataset contains 50 or 45 clusters, each cluster contains
a query and 25 documents
• Evaluate on some parameters– Do different e values affect the summarization process?
• e = 2 works best for DUC dataset
– How do topic-sensitivity and edge weighting perform in running PageRank?
• Applying both seems to have best effect
– How does skewing the graph affect the information flow in the graph?
• Skew of 1 works best, but need to try other possibilities
52NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Holistic Evaluation in DUC 2007Extractive-based TSG system
Used modified maximal marginal relevance for update tasks
– Penalize links in previously read articles
– Extension of inter-document factor (f)
Cluster 1 Cluster 2 Cluster 3
53NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Evaluation Results
Main task: 10th of 32 systems
Update task: 3rd of 24 systems
Conclusion• TSG formalism better tailored to deal with update / incremental text tasks• New method that may be competitive with current approaches
– Other top scoring systems may do sentence compression (abstractive), not just extraction
54NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Graph-based Update Summary - Conclusion
Proposed a timestamped graph model for text understanding and summarization
– Adds sentences in an incremental fashion
Future work: – Freely skewed model– Empirical and theoretical properties of TSGs
55NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Where do we go from here?
Thank you!http://wing.comp.nus.edu.sg/
Organizing data around entities, events • How people deal with data anyways• Understand objects and their inter/intra-relationship• Automation requires domain-expertise within a generic framework
“Collate all studies on SBP2 that new findings in the last year.”
“Oh, I meant the PROTEIN SBP2, not the gene.”
“What other proteins does SBP2 bind to?”
“Tell me more about the contradiction from previous results.”
“Which Miller did the study on SBP2 in 2002?”
Backup Slides – Entity Linkage
50 Minute talk total
7 Apr 2008, 10 – 11 AM
57NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Social network analysis• Connected triple
• Random walk
• Maximum flow
• Clustering
x2
x1
x1
x2
x3
s t
58NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Scalability Issues• Pairwise comparisons
– Requires O(n2) time– Major bottleneck
• Possible solutions– Blocking techniques– Avoiding pairwise
comparisons altogether
Input: d1, d2, …, dn
for i = 1 to n for j = (i + 1) to n compute sim(di, dj)
59NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Cost-utility Framework
f1 f2
r1
f3 f4 f5
r2
r3
r4
r5
r6
c1 c2 c3 c4 c5 u1 u2 u3 u4 u5
cost ofacquiring fi
utility ofacquiring fifeature fi
known value
value that can be acquired
60NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Record Matching
TITLE_MIN_LENTITLE_MAX_LEN
AUTHOR_MIN_LENAUTHOR_MAX_LEN
VENUE_MIN_LENVENUE_MAX_LEN
TITLE_SIMAUTHOR_SIMVENUE_SIM
MATCH/MISMATCH?
Header-reference pair (instance)
[1]Given information
[2]Information that canbe acquired at a cost
Training dataAssume all feature-valuesand their acquisition costsknown
Testing dataAssume [1] known, butfeature-values and theiracquisition costs in [2]unknown
CostsSet to MIN_LEN * MAX_LEN
61NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Costs and Utilities•Costs
–Trained 3 models (using M5’), treat as regression•Utilities
–Trained 2^3 = 8 classifiers (each to predict match/mismatch using only known feature-values)–For a test instance with a missing feature-value F
Get confidence of appropriate classifier without F
Get expected confidence of appropriate classifier with F
Utility is difference between the two confidence scores
•Note–Similar to Saar-Tsechansky et al.
62NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Normalized cost Recall Precision F-measure
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Normalized cost Recall Precision F-measure
Increasing proportion offeature-values acquired
Increasing proportion offeature-values acquired
Without cleaning of header records With manual cleaning of header records
63NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Selected Bibliography•General and surveys
–Ivan P. Fellegi and Alan B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183–1210, December 1969.–William E. Winkler and Yves Thibaudeau. An application of the Fellegi-Sunter Model of record linkage to the 1990 U.S. Decennial Census. Technical Report RR91/09, U.S. Bureau of the Census, 1991.–Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering (TKDE), 19(1):1–16, January 2007.–William E. Winkler. Overview of record linkage and current research directions. Technical Report RRS2006/02, U.S. Bureau of the Census, February 2006.–Mikhail Bilenko, Raymond J. Mooney, William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5):16–23, January/February 2003.–Min-Yen Kan and Yee Fan Tan. Record Matching in Digital Library Metadata. To appear in Communications of the ACM (CACM).
64NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Selected Bibliography•String matching
–Robert A. Wagner and Michael J. Fischer. The string-to-string correction problem. Journal of the Association of Computing Machinery, 21(1):168–173, January 1974.–Saul B. Needleman and Christian D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 148(3):443–453, March 1970.–Temple F. Smith and Michael S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195–197, March 1981.–Andrés Marzal and Enrique Vidal. Computation of normalized edit distance and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(9):926–932, September 1993.–Alvaro E. Monge and Charles Elkan. The field matching problem: Algorithms and applications. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 267–270, August 1996.–Jie Wei. Markov edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(3):311–321, March 2004.–Mikhail Bilenko and Raymond J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 39–48, August 2003.–Andrew McCallum, Kedar Bellare, and Fernando Pereira. A Conditional Random Field For Discriminatively-Trained Finite-State String Edit Distance. In Conference on Uncertainty in Artificial Intelligence (UAI), July 2005.–William. W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. A comparison of string distance metrics for name-matching tasks. In Information Integration on the Web (IIWeb), pages 73–78, August 2003.–Ariel S. Schwartz and Marti A. Hearst. A simple algorithm for identifying abbreviation definitions in biomedical text. In Pacific Symposium on Biocomputing (PSB), pages 451–462, January 2003.–Youngja Park and Roy J. Byrd. Hybrid text mining for finding abbreviations and their definitions. In Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 126–133, June 2001.–Jeffrey T. Chang , Hinrich Schütze, and Russ B. Altman. Creating an online dictionary of abbreviations from MEDLINE. Journal of the American Medical Informatics Association, 9(6):612–620, November/December 2002.–Hiroko Ao and Toshihisa Takagi. ALICE: An algorithm to extract abbreviations from MEDLINE. Journal of the American Medical Informatics Association, 12(5):576–586, September/October 2005.
65NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Selected Bibliography•Direct classification or clustering, and blocking
–Hui Han, Hongyuan Zha, and C. Lee Giles. A model-based K-means algorithm for name disambiguation. In Workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data, October 2003.–Hui Han, C. Lee Giles, Hongyuan Zha, Cheng Li, and Kostas Tsioutsiouliklis. Two supervised learning approaches for name disambiguation in author citations. In ACM/IEEE Joint Conference on Digital Libraries (JCDL), pages 296–305, June 2004.–Hui Han, Wei Xu, Hongyuan Zha, and C. Lee Giles. A hierarchical naive bayes mixture model for name disambiguation in author citations. In ACM Symposium on Applied Computing (SAC), pages 1065–1069, March 2005.–Hui Han, Hongyuan Zha, and C. Lee Giles. Name disambiguation in author citations using a K-way spectral clustering method. In ACM/IEEE Joint Conference on Digital Libraries (JCDL), pages 334–343, June 2005.–Dongwon Lee, Byung-Won On, Jaewoo Kang, and Sanghyun Park. Effective and scalable solutions for mixed and split citation problems in digital libraries. In ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS), pages 69–76, June 2005.–Byung-Won On, Dongwon Lee, Jaewoo Kang, and Prasenjit Mitra. Comparative study of name disambiguation problem using a scalable blocking-based framework. In ACM/IEEE Joint Conference on Digital Libraries (JCDL), pages 344–353, June 2005.–Andrew McCallum, Kamal Nigam, and Lyle Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 169–178, August 2000.–Matthew Michelson and Craig A. Knoblock. Learning blocking schemes for record linkage. In National Conference on Artificial Intelligence (AAAI), July 2006.–Mikhail Bilenko, Beena Kamath, and Raymond J. Mooney. Adaptive Blocking: Learning to Scale Up Record Linkage and Clustering. In IEEE International Conference on Data Mining (ICDM), December 2006.
66NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Selected Bibliography•Graphical models
–Jie Wei. Markov edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(3):311–321, March 2004.–John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In International Conference on Machine Learning (ICML), pages 282–289, June/July 2001.–Andrew McCallum and Ben Wellner. Object consolidation by graph partitioning with a conditionally-trained distance metric. In ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pages 19–24, August 2003.–Ben Wellner, Andrew McCallum, Fuchun Peng, and Michael Hay. An integrated, conditional model of information extraction and coreference with application to citation matching. In Conference on Uncertainty in Artificial Intelligence (UAI), pages 593–601, July 2004.–Andrew McCallum, Kedar Bellare, and Fernando Pereira. A Conditional Random Field For Discriminatively-Trained Finite-State String Edit Distance. In Conference on Uncertainty in Artificial Intelligence (UAI), July 2005.–Xin Dong, Alon Halevy, and Jayant Madhavan. Reference reconciliation in complex information spaces. In ACM SIGMOD International Conference on Management of Data, pages 85–96, June 2005.–Indrajit Bhattacharya and Lise Getoor. A latent dirichlet model for unsupervised entity resolution. In SIAM International Conference on Data Mining, pages 47–58, April 2006.
67NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Selected Bibliography•Social network analysis
–H. A. Kautz, B. Selman, and M. A. Shah. The hidden web. AI Magazine, 18(2):27–36, 1997.–P. Mutschke. Mining networks and central entities in digital libraries. A graph theoretic approach applied to co-author networks. In Intelligent Data Analysis (IDA), pages 155–166, August 2003.–M. E. J. Newman. Who is the best connected scientist? A study of scientific coauthorship networks. In Complex Networks, pages 337–370, February 2004.–E. Otte and R. Rousseau. Social network analysis: a powerful strategy, also for the information sciences. Journal of Information Science, 28(6), December 2002.–T. Krichel and N. Bakkalbasi. A social network analysis of research collaboration in the economics community. In International Workshop on Webometrics, Informetrics and Scientometrics & Seventh COLLNET Meeting, May 2006.–R. Rousseau and M. Thelwall. Escher staircases on the world wide web. First Monday, 9(6), June 2004.–D. G. Feitelson. On identifying name equivalences in digital libraries. Information Research, 9(4), October 2004.–R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In International conference on World Wide Web (WWW), pages 463–470, May 2005.–R. Holzer, B. Malin, and L. Sweeney. Email alias detection using social network analysis. In Workshop on Link Discovery: Issues, Approaches and Applications (LinkKDD), August 2005.–B. Malin, E. Airoldi, and K. M. Carley. A network analysis model for disambiguation of names in lists. Computational and Mathematical Organization Theory, 11(2):119–139, July 2005.–G. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 150–160, August 2000.–P. K. Reddy and M. Kitsuregawa. An approach to build a cyber-community hierarchy. In SIAM ICDM Workshop on Web Analysis, April 2002.–Patrick Reuther. Personal name matching: New test collections and a social network based approach. Technical Report Mathematics/Computer Science 06-01, University of Trier, March 2006.–Yutaka Matsuo, Junichiro Mori, Masahiro Hamasaki, Keisuke Ishida, Takuichi Nishimura, Hideaki Takeda, Kôiti Hasida, and Mitsuru Ishizuka. POLYPHONET: an advanced social network extraction system from the web. In International conference on World Wide Web (WWW), pages 397-406, May 2006.
68NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Selected Bibliography•Web-based methods
–Jamie P. Callan, Margie E. Connell, and Aiqun Du. Automatic discovery of language models for text databases. In ACM SIGMOD International Conference on Management of Data, pages 479–490, June 1999.–Jamie P. Callan and Margie E. Connell. Query-based sampling of text databases. ACM Transactions on Information Systems (TOIS), 19(2):97–130, April 2001.–Panagiotis G. Ipeirotis and Luis Gravano. Distributed search over the hidden-web: Hierarchical database sampling and selection. In International Conference on Very Large Databases (VLDB), pages 394–405, August 2002.–Luis Gravano, Panagiotis G. Ipeirotis, and Mehran Sahami. QProber: A system for automatic classification of hidden-web databases. ACM Transactions on Information Systems (TOIS), 21(1):1–41, January 2003.–Aron Culotta, Ron Bekkerman, and Andrew McCallum. Extracting social networks and contact information from email and the web. In Conference on Email and Anti-Spam (CEAS), July 2004.–Philipp Cimiano, Siegfried Handschuh, and Steffen Staab. Towards the self-annotating web. In International conference on World Wide Web (WWW), pages 462–471, May 2004.–Philipp Cimiano, Günter Ladwig, and Steffen Staab. Gimme the context: Context-driven automatic semantic annotation with C-PANKOW. In International conference on World Wide Web (WWW), pages 332–341, May 2005.–Yutaka Matsuo, Junichiro Mori, Masahiro Hamasaki, Keisuke Ishida, Takuichi Nishimura, Hideaki Takeda, Kôiti Hasida, and Mitsuru Ishizuka. POLYPHONET: an advanced social network extraction system from the web. In International conference on World Wide Web (WWW), pages 397-406, May 2006.–Yee Fan Tan, Min-Yen Kan, and Dongwon Lee. Search engine driven author disambiguation. In ACM/IEEE Joint Conference on Digital Libraries (JCDL), June 2006.–Ergin Elmacioglu, Min-Yen Kan, Dongwon Lee, and Yi Zhang. Googled name linkage. 2007.–Yee Fan Tan, Ergin Elmacioglu, Min-Yen Kan, and Dongwon Lee. Record Linkage of Short Forms to Long Forms: A Case Study of Publication Venues. 2007.–Min-Yen Kan. Web page classification without the web page. In International conference on World Wide Web (WWW), pages 262–263, May 2004.–Min-Yen Kan and Hoang Oanh Nguyen Thi. Fast webpage classification using url features. In International Conference on Information and Knowledge Management (CIKM), pages 325–326, October/November 2005.–Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, and Luis Gravano. To search or to crawl? Towards a query optimizer for text-centric tasks. In ACM SIGMOD International Conference on Management of Data, pages 265–276, June 2006.
Backup Slides - Summarization
50 Minute talk total
7 Apr 2008, 10 – 11 AM
70NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
A Summarization Machine
Summary
MULTIDOCS
Extract Abstract
Indicative
Generic
Background
Query-oriented
Just the news
10%
50%
100%
Very Brief Brief
Long
Headline
Informative
DOC QUERY
Generate a summary given a text document
71NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Summarization, defined• Definitions
Take a text document, extract content from it and present the most important content to the user in a condensed form and in a manner sensitive to the user’s or application’s needs
• Summarization requires::– understanding the meaning of a text document– generating fluent text summary
• Studies of human summarizers –Cremmins (65) & Endres-Niggemeyer (98) showed that professional summarizers used clues to pick summary content.
72NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Skew Degree Examples
time(d1) < time(d2) < time(d3) < time(d4)
d1 d2 d3 d4 d1 d2 d3 d4
Skewed by 1 Skewed by 2 Freely skewed
d1 d2 d3 d4
Freely skewed = Only add a new document when it would be linked by some node using vertex function σ
73NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Input text transformation function (τ)• Document Segmentation Function (τ)
– Problem observed in some clusters where some documents in a multi-document cluster are very long– Takes many timestamps to introduce all of the sentences, causing too many edges to be drawn
–Τ(G) segments long documents into several sub docs
• Solution is too hacked – hope to investigate more in current and future work
d5 d5bd5a
74NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Evaluation on number of edges (e)Tried different e values • Optimal performance: e = 2• At e = 1, graph is too loosely connected, not suitable for PageRank
→ very low performance• At e = N, a LexRank system
N NN
e = 2e = 2
75NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Evaluation (other edge parameters)• PageRank: generic vs topic-sensitive
• Edge weight (u): unweighted vs weighted
• Optimal performance: topic-sensitive PageRank and weighted edges
Topic-sensitive
Weighted edges
ROUGE-1 ROUGE-2
No No 0.39358 0.07690
Yes No 0.39443 0.07838
No Yes 0.39823 0.08072
Yes Yes 0.39845 0.08282
76NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Evaluation on skew degree (s)• Different skew degrees: s = 0, 1 and 2
• Optimal performance: s = 1
• s = 2 introduces a delay interval that is too large
• Need to try freely skewed graphs
Skew degree ROUGE-1 ROUGE-2
0 0.36982 0.07580
1 0.37268 0.07682
2 0.36998 0.07489
77NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Describing SummariesAspects of summarization (Sparck-Jones 97,
Hovy and Lin 99)• Input:
– Single-document vs. multi-document
• Purpose– Situation: embedded in larger system (MT, IR) or not? – Generic vs. query-oriented: author’s view or user’s interest?– Indicative vs. informative: categorization or understanding?– Background vs. just-the-news: does user have prior knowledge?
• Output– Extract vs. abstract: use text fragments or re-phrase content?
78NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Differences for main and update task processing
Main task:
1. Construct a TSG for input cluster
2. Run topic-sensitive PageRank on the TSG
3. Apply first modified version of MMR to extract sentences
Update task:
• Cluster A:– Construct a TSG for cluster A– Run topic-sensitive PageRank on the TSG– Apply the second modified version of MMR to extract sentences
• Cluster B:– Construct a TSG for clusters A and B– Run topic-sensitive PageRank on the TSG; only retain sentences from B– Apply the second modified version of MMR to extract sentences
• Cluster C:– Construct a TSG for clusters A, B and C– Run topic-sensitive PageRank on the TSG; only retain sentences from C– Apply the second modified version of MMR to extract sentences
79NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Sentence Ranking• Once a timestamped graph is built, we want to compute an prestige score for each node• PageRank: use an iterative method that allows the weights of the nodes to redistribute until stability is reached• Similarities as edges → weighted edges; query → topic-sensitive
Topic sensitive (Q)
portion
Standard random
walk term
80NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Sentence Extraction – Main task• Original MMR: integrates a penalty of the maximal similarity of the candidate document and one selected document
• Ye et al. (2005) introduced a modified MMR: integrates a penalty of the total similarity of the candidate sentence and all selected sentences
• Score(s) = PageRank score of s; S = selected sentences• This is used in the main task
Penalty: All previous sentence similarity
81NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
Sentence Extraction – Update task•Update task assumes readers already read previous cluster(s)
– implies we should not select sentences that have redundant information with previous cluster(s)
• Propose a modified MMR for the update task: – consider the total similarity of the candidate sentence with all selected sentences and sentences in previously-read cluster(s)
• P contains some top-ranked sentences in previous cluster(s)
Previous cluster overlap
82NIH Lister Hill Medical Center
Entity Linkage using the Web and Graph-based Update Summarization
References• Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based centrality as salience in text summari-zation. Journal of Artificial Intelligence Research, (22).
• Rada Mihalcea and Paul Tarau. 2004. TextRank: Bring-ing order into texts. In Proceedings of EMNLP 2004.
• S.N. Dorogovtsev and J.F.F. Mendes. 2001. Evolution of networks. Submitted to Advances in Physics on 6th March 2001.
• Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual Web search engine. Com-puter Networks and ISDN Systems, 30(1-7).
• Jon M. Kleinberg. 1999. Authoritative sources in a hy-perlinked environment. In Proceedings of ACM-SIAM Symposium on Discrete Algorithms, 1999.
• Shiren Ye, Long Qiu, Tat-Seng Chua, and Min-Yen Kan. 2005. NUS at DUC 2005: Understanding docu-ments via concepts links. In Proceedings of DUC 2005.