Upload
lisa-hamlyn
View
226
Download
1
Embed Size (px)
Citation preview
WikitologyWikitologyWikipedia as an OntologyWikipedia as an Ontology
Zareen Syed and Anupam Joshi
University of Maryland, Baltimore County
James Mayfield, Paul McNamee and Christine Piatko
JHU Human Language Technology Center of Excellence
Tim Finin, UMBC
Overview
• Introduction
• Wikipedia as an ontology
• Applications
• Discussion
• Conclusion
introduction wikitology applications discussion conclusion
Wikis and Knowledge• Wikis are a great way to collaborate on knowledge encoding– Wikipedia is an archetype for this, but there
are many examples
• Ongoing research is exploring how to integrate this with structured knowledge– DBpedia, Semantic Media Wiki, Freebase, etc.
• I’ll describe an approach we’ve taken and experiments in using it– We came at this from an IR/HLT perspective
introduction wikitology applications discussion conclusion
Wikipedia data in RDF
introduction wikitology applications discussion conclusion
Populating Freebase KB
introduction wikitology applications discussion conclusion
Populating Powerset’s KB
introduction wikitology applications discussion conclusion
AskWiki uses Wikipedia for QA
introduction wikitology applications discussion conclusion
With sometimes surprising results
introduction wikitology applications discussion conclusion
TrueKnowledge mines Wikipedia
introduction wikitology applications discussion conclusion
Wikipedia pages as tags
introduction wikitology applications discussion conclusion
Wikitology
We are exploring an approach to deriving an ontology from Wikipedia that is useful in a variety of language processing tasks
introduction wikitology applications discussion conclusion
Our original problem (2006)
• Problem: describe what an analyst has been working on to support collaboration
• Idea: track documents she reads and map these to terms in an ontology, aggregate to produce a short list of topics
• Approach: use Wikipedia articles as ontology terms, use document-article similarity for the mapping, and spreading activation for aggregation
introduction wikitology applications discussion conclusion
What’s a document about?
Two common approaches:
(1) Select words and phrases using TF-IDF that characterize the document
(2) Map document to a list of terms from a controlled vocabulary or ontology
(1) is flexible and does not require creating and maintaining an ontology
(2) can tie documents to a rich knowledge base
introduction wikitology applications discussion conclusion
Wikitology !• Using Wikipedia as an ontology offers the best of both approaches– each article (~3M) is a concept in the ontology– terms linked via Wikipedia’s category system
(~200k) and inter-article links– Lots of structured and semi-structured data
• It’s a consensus ontology created and maintained by a diverse community
• Broad coverage, multilingual, very current
• Overall content quality is highintroduction wikitology applications discussion conclusion
Wikitology features• Terms have unique IDs (URLs) and are “self describing” for people
• Underlying graphs provide structure and associations: categories, article links, disambiguation, aliases (redirects), …
• Article history contains useful meta-data for trust, provenance, controversy, …
• External sources provide more info (e.g., Google’s PageRank)
• Annotated with structured data from DBpedia, Freebase, Geonames & LOD
introduction wikitology applications discussion conclusion
Problems as an OntologyTreating Wikipedia as an ontology reveals many problems
•Uncategorized and miscategorized articles•Single document in too many categories:
– George W. Bush is included in about 30 categories
•Links between articles belonging to very different categories
– John F. Kennedy has a link for “coincidence theory” which belongs to the Mathematical Analysis/ Topology/Fixed Points
introduction wikitology applications discussion conclusion
Problems as an Ontology
•Article links in text are not “typed”•Uneven category articulation
– Some categories are under represented where as others have many articles
•Administrative categories, e.g.– Clean up from Sep 2006– Articles with unsourced statements
•Over-linking, e.g.– A mention of United States linked to the
page United_states– Mentions of 1949 linked to the year 1949
introduction wikitology applications discussion conclusion
Problems as an Ontology
Wikipedia’s infobox templates have great potential for have several problems•Multiple templates for same class
•Multiple attribute names for same property– E.g., six attributes for a person’s birth date
•Attributes lack domains or datatypes– E.g., value can be string or link
introduction wikitology applications discussion conclusion
Wikitology 1, 2, 3
• We’ve addressed some of of these problems in developing Wikitology
• The development has been driven by several use cases and applications
introduction wikitology applications discussion conclusion
Wikitology Use Cases• Identifying user context in a collaboration system from documents viewed (2006)
• Improve IR accuracy of by adding Wikitology tags to documents (2007)
• Cross document co-reference resolution for named entities in text (2008)
• Knowledge Base population from text (2009)
• Improve Web search engine by tagging documents and queries (2009)
introduction wikitology applications discussion conclusion
Wikitology 1.0 (2007)• Structured Data
– Specialized concepts (article titles)– Generalized concepts (category titles)– Inter-category and -article links as relations
between concepts– Article-category links as relations between
specialized and generalized concepts
• Un-Structured Data– Article text
• Algorithms to remove useless categor-ies and links, infer categories, and select, rank and aggregate concepts using the hybrid knowledge base
Human input& editing
textgraphs
introduction wikitology applications discussion conclusion
Experiments• Goal: given one or more documents, compute
a ranked list of the top Wikipedia articles and/or categories that describe it.
• Basic metric: document similarity between Wikipedia article and document(s)
• Variations: role of categories, eliminating uninteresting articles, use of spreading activation, using similarity scores, weighing links, number of spreading activation pulses, individual or set of query documents, etc, etc.
introduction wikitology applications discussion conclusion
Method 1
Querydoc(s)
similar to
Cosine similarity
Similar Wikipedia Articles
Using Wikipedia article text & categories to predict concepts
0.2 0.10.8
0.2
Input
introduction wikitology applications discussion conclusion
Method 1
Querydoc(s)
similar to
Cosine similarity
Wikipedia Category Graph
Similar Wikipedia Articles0.2 0.1
0.3
0.8
0.2
Input
Using Wikipedia article text & categories to predict concepts
introduction wikitology applications discussion conclusion
Method 1
Querydoc(s)
similar to
Rank Categories
1. Links
2. Cosine similarity
Cosine similarity
Wikipedia Category Graph
Similar Wikipedia Articles0.2 0.1
0.3
0.8
0.2
0.93
Input
Output
Using Wikipedia article text & categories to predict concepts
introduction wikitology applications discussion conclusion
Method 2
Querydoc(s)
Similar to
Cosine similarity
Wikipedia Category Graph
Using spreading activation on category link graph to get aggregated concepts
0.2 0.1
0.3
0.8
0.2
Input
Ranked Concepts based
on Final Activation Score
Output
Spreading Activation
i
ij OI
kD
AO
j
jj
*
Input Function
Output Function
introduction wikitology applications discussion conclusion
Method 3
k
AO
jj
Querydoc(s)
Similar To
Ranked Concepts based on Final Activation Score
Spreading Activation
Threshold: Ignore Spreading Activation to articles with less than 0.4 Cosine similarity score
Edge Weights: Cosine similarity between linkedarticles
Wikipedia Article Links Graph
Using spreading activation on article link graph
Node Input Function
Node Output Function
ijij wOIi Output
Input
Evaluation
• An initial informal evaluation compared results against our own judgments
• Used to select promising combinations of ideas and parameter settings
• Formal evaluation: – Selected Wikipedia articles for testing;
remove from Lucene index and graphs– For each, use methods to predict categories
and linked articles– Compare results using precision and recall
to known categories and linked articles
introduction wikitology applications discussion conclusion
Method 1Ranking Categories Directly
Method 2 (2 pulses)Spreading Activation on Category
links Graph
Method 3 (2 pulses)Spreading Activation on Article
Links Graph
AgricultureSustainable_technologies
CropsAgronomy
Permaculture
SkillsApplied_sciences
Land_managementFood_industry
Agriculture
Organic_farmingSustainable_agriculture
Organic_gardeningAgriculture
Companion_planting
Test Document Titles in the Set: (Wikipedia Articles)Crop_rotation Permaculture Beneficial_insectsNeem Lady_BirdPrinciples_of_Organic_AgricultureRhizobiaBiointensiveInter croppingGreen_manure
ExamplePrediction for Set of Test Documents
Concept not in the Category Hierarchy
Category prediction evaluation
• Spreading activation with two pulses worked best• Only considering articles with similarity > 0.5 was
a good threshold
Avg. Similarity Threshold
Precision Average Precision Recall F-Measure
M1 SA1 SA2 M1 (1)
M1 (2)
SA1 SA2 M1 SA1 SA2 M1 SA1 SA2
0 0.24 0.3 0.32 0.61 0.65 0.6 0.74 0.81 0.93 0.97 0.38 0.45 0.49 0.1 0.25 0.3 0.33 0.62 0.65 0.61 0.75 0.81 0.93 0.97 0.38 0.46 0.49 0.2 0.29 0.34 0.37 0.66 0.69 0.67 0.78 0.85 0.95 0.97 0.43 0.5 0.53 0.3 0.36 0.43 0.47 0.76 0.81 0.77 0.85 0.91 0.97 0.99 0.51 0.6 0.64 0.4 0.42 0.52 0.57 0.87 0.92 0.88 0.95 0.95 0.98 1 0.58 0.68 0.73 0.5 0.45 0.57 0.62 0.91 0.96 0.92 0.98 0.94 0.97 1 0.61 0.72 0.77 0.6 0.55 0.63 0.68 0.92 1 0.97 1 1 1 1 0.71 0.77 0.81 0.7 0.55 0.63 0.68 0.92 1 0.97 1 1 1 1 0.71 0.77 0.81 0.8 1 1 1 1 1 1 1 1 1 1 1 1 1
introduction wikitology applications discussion conclusion
Article prediction evaluation
• Spreading activation with one pulse worked best• Only considering articles with similarity > 0.5
was a good threshold
Avg. Similarity Threshold
Precision Average Precision
Recall F-Measure
0 0.28 0.5 0.53 0.31
0.1 0.28 0.5 0.53 0.31
0.2 0.32 0.56 0.58 0.35
0.3 0.41 0.69 0.66 0.44
0.4 0.51 0.85 0.79 0.56
0.5 0.59 0.94 0.88 0.67
0.6 0.53 0.91 0.9 0.63
0.7 0.66 1 1 0.79
0.8 0.67 1 1 0.8
introduction wikitology applications discussion conclusion
Improving IR performance (2008-09)
• Improving IR performance for a collection by adding semantic terms to documents
• Query with blind relevance feedback may benefit from the semantic terms
• Initial evaluation with NIST TREC 2005 collection in collaboration with Paul McNamee, JHU HLTCOE
• Ongoing: integration into RiverGlass MORAG search engine
introduction wikitology applications discussion conclusion
Improving IR performance
... Alan Turing, described as a brilliant mathematician and a key figure in the breaking of the Nazis' Enigma codes. Prof IJ Good says it is as well that British security was unaware of Turing's homosexuality, otherwise he might have been fired 'and we might have lost the war'. In 1950 Turing wrote the seminal paper 'Computing Machinery And Intelligence', but in 1954 killed himself ...
Turing_machine, Turing_test, Church_Turing_thesis, Halting_problem, Computable_number, Bombe, Alan_Turing, Recusion_theory, Formal_methods, Computational_models, Theory_of_computation, Theoretical_computer_science, Artificial_Intelligence
Doc: FT921-4598 (3/9/92)
introduction wikitology applications discussion conclusion
Evaluation
• Mixed results on NIST evaluation
• Slightly worse on mean average precision
• Slightly better for precision at 10
MAP P@10
base 0.2076 0.4207
Base + rf 0.2470 0.4480
Concepts + rf 0.2400 0.4553
introduction wikitology applications discussion conclusion
Information Extraction
• Problem: resolve entities found by a named entity recognition system across documents to a KB entries
• ACE 2008: NIST run Automatic Extrac-tion Conference is focused on this task – We were part of a team lead by JHU
Human Language Technology Center of Excellence
– Use Wikitology to map document entities to KB entities
introduction wikitology applications discussion conclusion
Wikitology 2.0 (2008)
WordNetYago
Human input & editingDatabases
Freebase KB
RDF RDF
textgraphs
Named Entity Recognition
Timothy F. Geithner, who as president of the New York Federal Reserve Bank oversaw many of the nation’s most powerful financial institutions, stunned the group with the audacity of his answer. He proposed asking Congress to give the president broad power to guarantee all the debt in the banking system, according to two participants, including Michele Davis, then an assistant Treasury secretary.
Named Entity Recognition
Timothy F. Geithner, who as president of the New York Federal Reserve Bank oversaw many of the nation’s most powerful financial institutions, stunned the group with the audacity of his answer. He proposed asking Congress to give the president broad power to guarantee all the debt in the banking system, according to two participants, including Michele Davis, then an assistant Treasury secretary.
Open Calais
Free NER service that returns results in RDF
Global Coreference Task• Start with entities and relations produced by a within document
extraction system– Produce ‘Global’ clusters for PERSON and ORGANIZATION entities
– Only evaluate over instances of entities with a name
• Challenges:– Very limited development data
• ACE released 49 files in English, none in Arabic• MITRE released English ACE05 corpus, but annotation is noisy and data has few
ambiguous entities
– Within document mistakes are propagated to cross-document system– 10K document evaluation set required work on scalability of
approaches
William Wallace (living British Lord)
William Wallace (of Braveheart fame)
Abu Abbas aka Muhammad Zaydan aka Muhammad Abbas
introduction wikitology applications discussion conclusion
Global Coreference Resolution Approach
• Serif for intra-document processing• Entity Filtering
– Collect all pairs of SERIF entities– Filter entity pairs with heuristics (e.g.,
string similarity of mentions) to get high-recall set of pairs significantly smaller than n2 possible pairs
• Feature generation• Training
– Train SVM to identify coreferent pairs
• Entity Clustering– Cluster predicted pairs– Each connected component forms a
global entity
• Relation Identification– Every pair of SERIF-identified relations
whose types are identical and whose endpoints are coreferent are deemed to be coreferent
Entity Clusters: Abu MazenMahmoud Abbas
Muhammed Abbas Abu AbbasPalestinian Leader
convicted terrorist
Filtered Pairs:
E1, E2 (shared word) E1, E3 (shared word) E2, E3 (known alias)
Features: E1, E2: character overlap: 5 E1, E2: distinct Freebase entities: true E1, E3: character overlap: 3E1, E3: distinct Freebase entities: false ….
Document Entities:
E2: Palestinian President Mahmoud Abbas ...
E1: Abu Abbas was arrested …
E3: … election of Abu Mazen
E4: … president George Bush
introduction wikitology applications discussion conclusion
Wikitology tagging• Using Serif’s output, we produced an entity document for each entity.Included the entity’s name, nominal and pronom-inal mentions, APF type and subtype, and words in a window around the mentions
• We tagged entity documents using Wiki-tology producing vectors of (1) terms and (2) categories for the entity
• We used the vectors to compute fea-tures measuring entity pair similarity/dissimilarity
introduction wikitology applications discussion conclusion
Entity Document & Tags<DOC>
<DOCNO>ABC19980430.1830.0091.LDC2000T44-E2 <DOCNO>
<TEXT>
Webb Hubbell
PER
Individual
NAM: "Hubbell” "Hubbells” "Webb Hubbell” "Webb_Hubbell"
NAM: "Mr . " "friend” "income"
PRO: "he” "him” "his"
, . abc's accountant after again ago all alleges alone also and arranged attorney avoid been before being betray but came can cat charges cheating circle clearly close concluded conspiracy cooperate counsel counsel's department did disgrace do dog dollars earned eightynine enough evasion feel financial firm first four friend friends going got grand happening has he help him hi s hope house hubbell hubbells hundred hush income increase independent indict indicted indictment inner investigating jackie jackie_judd jail jordan judd jury justice kantor ken knew lady late law left lie little make many mickey mid money mr my nineteen nineties ninetyfour not nothing now office other others paying peter_jennings president's pressure pressured probe prosecutors questions reported reveal rock saddened said schemed seen seven since starr statement such tax taxes tell them they thousand time today ultimately vernon washington webb webb_hubbell were what's whether which white whitewater why wife years
</TEXT>
</DOC>
Wikitology article tag vector
Webster_Hubbell 1.000
Hubbell_Trading_Post National Historic Site 0.379
United_States_v._Hubbell 0.377
Hubbell_Center 0.226
Whitewater_controversy 0.222
Wikitology category tag vector
Clinton_administration_controversies 0.204
American_political_scandals 0.204
Living_people 0.201
1949_births 0.167
People_from_Arkansas 0.167
Arkansas_politicians 0.167
American_tax_evaders 0.167
Arkansas_lawyers 0.167
Wikitology derived features
• Seven features measured entity similarity using cosine similarity of various length article or category vectors
• Five features measured entity dissimilarity:• two PER entities match different Wikitology persons• two entities match Wikitology tags in a disambiguation set• two ORG entities match different Wikitology organizations• two PER entities match different Wikitology persons,
weighted by 1-abs(score1-score2)• two ORG entities match different Wikitology orgs,
weighted by 1-abs(score1-score2)
introduction wikitology applications discussion conclusion
COE Features• Character-level features
– Exact Match of NAM mentions• Longest mention exact match
• Some mention exact match
• Multiple mention exact match
• All mention exact match
– Partial Match• Dice score, character bigrams
• Dice score, longest mention character bigrams
• Last word of longest string match
– Matching nominals and pronominals
• Exact match
• Multiple exact match
• All match
• Dice score of mention strings
• Document-level features– Words
• Dice score, words in document
• Dice score, words around mentions
• Cosine score, words in document
• Cosine score, words around mentions
– Entities• Dice score, entities in document
• Dice score, entities around mentions
• Metadata features– Speech/text– News/non-news– Same document– Social context features
• Heuristic
• Probabilistic
introduction wikitology applications discussion conclusion
More COE Features
• KB features - instances– Known alias
• Also derived aliases from test collection
– BBN name match– Famous singleton
• KB features - semantic match– Entity type match– Sex match– Number match– Occupation match– Fuzzy occupation match– Nationality match– Spouse match– Parent match– Sibling match
• KB features - ontology– Wikitology
• Top Wikitology category matches
• Top Wikitology article matches
• Different top Wikitology person
• Different top Wikitology organization
• Top Wikitology categories in disambiguation set
– Reuters topics• Cosine score, words in document
• Cosine score, words around mentions
– Thesaurus concepts• Cosine score, words in document
• Cosine score, words around mentions
introduction wikitology applications discussion conclusion
Clustering• Approach
– Assign score to each entity pair (SVM or heuristic)– Eliminate pairs whose score does not exceed
threshold (0.95 for SVM runs)– Identify connected components in resulting graph
• Large clusters– AP (good)– Clinton (bad; conflates William and Hillary)– Sources of large clusters varied
• Connected components clustering• SERIF errors• Insufficient features to distinguish separate entities
introduction wikitology applications discussion conclusion
Features with High F1 scores
• Recall that F1 = 2*P*R/(P+R)• Variants of exact name match, in general,
especially: a name mention in one entity exactly matches one in the other (83.1%)
• Cosine similarity of the vectors of top Wikitology article matches (75.1%)
• Top Wikitology article for the two entities matched (38.1%)
• An entity contained a mention that was a known alias of a mention found in the other (47.5%)
introduction wikitology applications discussion conclusion
Feature Ablation
A post hoc feature ablation evaluationshowed contribution of KB features
introduction wikitology applications discussion conclusion
High Precision Features• High precision/low recall features are useful when applicable
• Features with precision > 95% include:– A name mentioned by each entity matches
exactly one person in Wikipedia– The entities have the same parent– The entities have the same spouse– All name mentions have an exact match
across the two entities– Longest named mention has exact match
introduction wikitology applications discussion conclusion
Knowledge Base Population• The 2009 NIST Text Analysis Confer- ence (TAC) will include a new Knowledge Base Population track
• Goal: discover information about named entities (people, organizations, places) and incorporate it into a KB
• TAC KBP has two related tasks:–Entity linking: doc. entity mention -> KB entity –Slot filling: given a document entity mention,
find missing slot values in large corpus
introduction wikitology applications discussion conclusion
KBs and IE are Symbiotic
KnowledgeBase
KnowledgeBase
Information Extraction from Text
Information Extraction from Text
KB info helps interpret text
IE helps populate KBs
introduction wikitology applications discussion conclusion
Planned Extensions• Make greater use of data from Linked Open Data (LOD) resources: DBpedia, Geonames, Freebase
• Replace ad hoc processing of RDF data in Lucene with a triple store
• Add additional graphs (e.g., derived from infobox links and develop algorithms to exploit them
• Develop a better hybrid query creation tools
introduction wikitology applications discussion conclusion
InfoboxGraph
InfoboxGraph
IRcollection
IRcollection
RelationalDatabaseRelationalDatabase
TripleStoreTripleStore
RDFreasoner
RDFreasoner
Page LinkGraph
Page LinkGraph
CategoryLinks Graph
CategoryLinks Graph
ArticlesArticles
WikitologyCode
WikitologyCode
Application Specific Algorithms
Application Specific Algorithms
Application Specific Algorithms
Application Specific Algorithms
Application Specific Algorithms
Application Specific Algorithms
Wikitology 3.0 (2009)
Wikitology 3.0 (2009)
LinkedSemanticWeb data &ontologies
InfoboxGraph
InfoboxGraph
Challenges• Wikitology tagging is expensive
– ~3 seconds/document– ACE English: ~150K entities (~24 hr on Bluegrit)– A spreading activation algorithm on the underlying
graphs improves accuracy at even more cost
• Exploit the RDF metadata and data and the underlying graphs– requires reasoning and graph processing
• Extract entities from Wiki text to find more relations– More graph processing
introduction wikitology applications discussion conclusion
Wikipedia’s social network
• Wikipedia has an implicit ‘social network’ that can help disambiguate PER mentions
• Resolving PER mentions in a short document to KB people who are linked in the KB is good
• The same can be done for the network of ORG and GPE entities
WSN Data
• We extracted 213K people from the DBpedia’s Infobox dataset, ~30K of which participate in an infobox link to another person
• We extracted 875K people from Freebase, 616K of were linked to Wikipedia pages, 431K of which are in one of 4.8M person-person article links
• Consider a document that mentions two people: George Bush and Mr. Quayle
Which Bush & which Quayle?
Six George Bushes Nine Male Quayles
A simple closeness metric
Let Si = {two hop neighbors of Si}
Cij = |intersection(Si,Sj)| / |union(Si,Sj) |
Cij>0 for six of the 56 possible pairs
0.43 George_H._W._Bush -- Dan_Quayle
0.24 George_W._Bush -- Dan_Quayle
0.18 George_Bush_(biblical_scholar) -- Dan_Quayle
0.02 George_Bush_(biblical_scholar) -- James_C._Quayle
0.02 George_H._W._Bush -- Anthony_Quayle
0.01 George_H._W._Bush -- James_C._Quayle
Application to TAC KBP
• Using entity network data extracted from Dbpedia and Wikipedia provides evidence to support KBP tasks:– Mapping document mentions into
infobox entities– Mapping potential slot fillers into
infobox entities– Evaluating the coherence of entities
as potential slot fillers
Next Steps• Construct a Web-based API and demo system to facilitate experimentation
• Process Wikitology updates in real-time• Exploit machine learning to classify pages and improve performance
• Better use of cluster using Hadoop, etc.• Exploit cell technology for spreading activation and other graph-based algorithms– e.g., recognize people by the graph of
relations they are part ofintroduction wikitology applications discussion conclusion
Dbpedia ontology
•Dbpedia 3.2 (Nov 2008) added a manually constructed ontology with– 170 classes in a subsumption hierarchy– 880K instances– 940 properties with domain and range
•A partial, manual mapping was constructed from infobox attributes to these term
•Current domain and range constraints are “loose”
•Namespace: http://dbpedia.org/ontology/
Place 248,000Person 214,000Work 193,000Species 90,000Org. 76,000Building 23,000
Person56 properties
Organisation50 properties
Place110 properties
Exploiting Linked Data
Conclusion• Our initial applications shows that the Wikitology idea has merit
• Wikipedia is increasingly being used as a knowledge source of choice
• Easily extendable to other wikis and collaborative KBs, e.g., Intellipedia
• Serious use may require exploiting cluster machines and cell processing
• We need to move beyond Wikipedia to exploit the LOD cloud
introduction wikitology applications discussion conclusion