12
Discovering Attribute and Entity Synonyms for Knowledge Integration and Semantic Web Search Hamid Mousavi 1 , Shi Gao 2 , Carlo Zaniolo 3 Technical Report #130013 1,2,3 Computer Science Department, UCLA Los Angeles, USA 1 [email protected] 2 [email protected] 3 [email protected] Abstract— There is a growing interest in supporting semantic search on knowledge bases such as DBpedia, YaGo, FreeBase, and other similar systems, which play a key role in many semantic web applications. Although the standard RDF format ăsubject, attribute, valueą is often used by these systems, the sharing of their knowledge is hampered by the fact that various synonyms are frequently used to denote the same entity or attribute— actually, even an individual system may use alternative synonyms in different contexts, and polynyms also represent a frequent problem. Recognizing such synonyms and polynyms is critical for improving the precision and recall of semantic search. Most of previous efforts in this area have focused on entity synonym recognition, whereas attribute synonyms were neglected, and so was the use of context to select the appropriate synonym. For instance, the attribute ‘birthdate’ can be a synonym for ‘bornwhen it is used with a value of type ‘date’; but if ‘born’ comes with values which indicate places, then ‘birthplace’ should be considered as its synonym. Thus, the context is critical to find more specific and accurate synonyms. In this paper, we propose new techniques to generate context- aware synonyms for the entities and attributes that we are using to reconcile knowledge extracted from various sources. To this end, we propose the Context-aware Synonym Suggestion System (CS 3 ) which learns synonyms from text by using our NLP-based text mining framework, called SemScape, and also from existing evidence in the current knowledge bases. Using CS 3 and our previously proposed knowledge extraction system IBminer, we integrate some of the publicly available knowledge bases into one of the superior quality and coverage, called IKBstore. I. I NTRODUCTION The importance of knowledge bases in semantic-web ap- plications has motivated the endeavors of several important projects that have created the public-domain knowledge bases shown in Table I. The project described in this paper seeks to integrate and extend these knowledge bases into a more complete and consistent repository named Integrated Knowl- edge Base Store (IKBstore). IKBstore will provide much better support for advanced web applications, and in particular for user-friendly search systems that support Faceted Search [5] and By-Example Structured Queries [6]. Our approach in achieving this ambitious goal involves the four main tasks of: A Integrating existing knowledge bases by converting them into a common internal representation and store them in IKBstore. B Completing the integrated knowledge base by extracting more facts from free text. C Generating a large corpus of context-aware synonyms that can be used to resolve inconsistencies in IKBstore and to improve the robustness of query answering sys- tems, D Resolving incompleteness in IKBstore by using the synonyms generated in Task C. At the time of this writing, task A and B are completed, and other tasks are well on their ways. The tools developed to perform tasks B, C and D, and the initial results they produced will be demonstrated at the VLDB conference on Riva del Garda [20]. The rest of this section and the next section introduce the various intertwined aspects of these tasks, while the remaining sections provide an in-depth coverage of the techniques used for synonyms generation and the very promising experimental results they produced. Task A was greatly simplified by the fact that many projects, including DBpedia [7] and YaGo [18], represent the information derived from the structured summaries of Wikipedia (a.k.a. InfoBoxes) by RDF triples of the form ăsubject, attribute, valueą, which specifies the value for an attribute (property) of a subject. This common rep- resentation facilitates the use of these knowledge bases by a roster of semantic-web applications, including queries ex- pressed in SPARQL, and user-friendly search interfaces [5], [6]. However, the coverage and consistency provided by each individual system remain limited. To overcome these prob- Name Size (MB) Number of Number of Entities (10 6 ) Triples (10 6 ) ConceptNet [27] 3075 0.30 1.6 DBpedia [7] 43895 3.77 400 FreeBase [8] 85035 «25 585 Geonames [2] 2270 8.3 90 MusicBrainz [3] 17665 18.3 «131 NELL [10] 1369 4.34 50 OpenCyc [4] 240 0.24 2.1 YaGo2 [18] 19859 2.64 124 TABLE I SOME OF THE PUBLICLY AVAILABLE KNOWLEDGE BASES

Discovering Attribute and Entity Synonyms for Knowledge ...fmdb.cs.ucla.edu/Treports/IKBstroe CSD TR.pdf · a roster of semantic-web applications, including queries ex-pressed in

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Discovering Attribute and Entity Synonyms for Knowledge ...fmdb.cs.ucla.edu/Treports/IKBstroe CSD TR.pdf · a roster of semantic-web applications, including queries ex-pressed in

Discovering Attribute and Entity Synonyms forKnowledge Integration and Semantic Web Search

Hamid Mousavi 1, Shi Gao 2, Carlo Zaniolo 3

Technical Report #1300131,2,3Computer Science Department, UCLA

Los Angeles, [email protected]@[email protected]

Abstract— There is a growing interest in supporting semanticsearch on knowledge bases such as DBpedia, YaGo, FreeBase, andother similar systems, which play a key role in many semanticweb applications. Although the standard RDF format ăsubject,attribute, valueą is often used by these systems, the sharing oftheir knowledge is hampered by the fact that various synonymsare frequently used to denote the same entity or attribute—actually, even an individual system may use alternative synonymsin different contexts, and polynyms also represent a frequentproblem. Recognizing such synonyms and polynyms is criticalfor improving the precision and recall of semantic search. Mostof previous efforts in this area have focused on entity synonymrecognition, whereas attribute synonyms were neglected, and sowas the use of context to select the appropriate synonym. Forinstance, the attribute ‘birthdate’ can be a synonym for ‘born’when it is used with a value of type ‘date’; but if ‘born’ comeswith values which indicate places, then ‘birthplace’ should beconsidered as its synonym. Thus, the context is critical to findmore specific and accurate synonyms.

In this paper, we propose new techniques to generate context-aware synonyms for the entities and attributes that we are usingto reconcile knowledge extracted from various sources. To thisend, we propose the Context-aware Synonym Suggestion System(CS3) which learns synonyms from text by using our NLP-basedtext mining framework, called SemScape, and also from existingevidence in the current knowledge bases. Using CS3 and ourpreviously proposed knowledge extraction system IBminer, weintegrate some of the publicly available knowledge bases intoone of the superior quality and coverage, called IKBstore.

I. INTRODUCTION

The importance of knowledge bases in semantic-web ap-plications has motivated the endeavors of several importantprojects that have created the public-domain knowledge basesshown in Table I. The project described in this paper seeksto integrate and extend these knowledge bases into a morecomplete and consistent repository named Integrated Knowl-edge Base Store (IKBstore). IKBstore will provide much bettersupport for advanced web applications, and in particular foruser-friendly search systems that support Faceted Search [5]and By-Example Structured Queries [6]. Our approach inachieving this ambitious goal involves the four main tasks of:

A Integrating existing knowledge bases by converting theminto a common internal representation and store them inIKBstore.

B Completing the integrated knowledge base by extractingmore facts from free text.

C Generating a large corpus of context-aware synonymsthat can be used to resolve inconsistencies in IKBstoreand to improve the robustness of query answering sys-tems,

D Resolving incompleteness in IKBstore by using thesynonyms generated in Task C.

At the time of this writing, task A and B are completed,and other tasks are well on their ways. The tools developedto perform tasks B, C and D, and the initial results theyproduced will be demonstrated at the VLDB conference onRiva del Garda [20]. The rest of this section and the nextsection introduce the various intertwined aspects of these tasks,while the remaining sections provide an in-depth coverage ofthe techniques used for synonyms generation and the verypromising experimental results they produced.

Task A was greatly simplified by the fact that manyprojects, including DBpedia [7] and YaGo [18], representthe information derived from the structured summaries ofWikipedia (a.k.a. InfoBoxes) by RDF triples of the formăsubject, attribute, valueą, which specifies the value foran attribute (property) of a subject. This common rep-resentation facilitates the use of these knowledge bases bya roster of semantic-web applications, including queries ex-pressed in SPARQL, and user-friendly search interfaces [5],[6]. However, the coverage and consistency provided by eachindividual system remain limited. To overcome these prob-

Name Size (MB) Number of Number ofEntities (106) Triples (106)

ConceptNet [27] 3075 0.30 1.6DBpedia [7] 43895 3.77 400FreeBase [8] 85035 «25 585Geonames [2] 2270 8.3 90

MusicBrainz [3] 17665 18.3 «131NELL [10] 1369 4.34 50

OpenCyc [4] 240 0.24 2.1YaGo2 [18] 19859 2.64 124

TABLE ISOME OF THE PUBLICLY AVAILABLE KNOWLEDGE BASES

Page 2: Discovering Attribute and Entity Synonyms for Knowledge ...fmdb.cs.ucla.edu/Treports/IKBstroe CSD TR.pdf · a roster of semantic-web applications, including queries ex-pressed in

UCLA Computer Science Department Technical Report #130013

lems, this project is merging, completing, and integrating theseknowledge bases at the semantic level.

Task B is mainly to complete the initial knowledge baseusing our knowledge extraction system called IBminer [22].IBminer employs an NLP-based text mining framework, calledSemScape, to extract initial triples from the text. Then usinga large body of categorical information and learning frommatches between the initial triples and existing InfoBox itemsin the current knowledge base, IBminer translates the initialtriples into more standard InfoBox triples.

The integrated knowledge base so obtained will representa big step forward, since it will (i) improve coverage, qualityand consistency of the knowledge available to semantic webapplications and (ii) provide a common ground for differ-ent contributors to improve the knowledge bases in a morestandard and effective way. However, a serious obstacle inachieving such desirable goal is that different systems do notadhere to a standard terminology to represent their knowledge,and instead use plethora of synonyms and polynyms.

Thus, we need to resolve synonyms and polynyms for theentity names as well as the attribute names. For example,by knowing ‘Johann Sebastian Bach’ and ‘J.S. Bach’ aresynonyms, the knowledge base can merge their triples andassociate them with one single name. As for the polynyms,the problem is even more complex. Most of the time basedon the context (or popularity), one should decide the correctpolynym of a vague term such as ‘JSB’ which may refer to“Johann Sebastian Bach”, ‘Japanese School of Beijing’, etc.Several efforts to find entity synonyms have been reported inrecent years [11], [12], [13], [15], [25]. However, the synonymproblem for attribute names has received much less attention,although they can play a critical role in query answering. Forinstance, the attribute ‘birthdate’ can be represented with termssuch as ‘date of birth’, ‘wasbornindate’, ‘born’, and ‘DoB’in different knowledge bases, or even in the same one whenused in different contexts. Unless these synonyms are known,a search for musicians born, say, in 1685 is likely to producea dismal recall.

To address the aforementioned issues, we proposed ourContext-aware Synonym Suggestion System (CS3 for short).CS3 performs mainly tasks C and D by first extractingcontext-aware attribute and entity synonyms, and then usingthem to improve the consistency of IKBstore. CS3 learnsattribute synonyms by matching morphological informationin free text to the existing structured information. Similarto IBminer, CS3 takes advantage of a large body of cate-gorical information available in Wikipedia, which serves asthe contextual information. Then, CS3 improves the attributesynonyms so discovered, by using triples with matching sub-jects and values but different attribute names. After unifyingthe attribute names in different knowledge bases, CS3 findssubjects with similar attributes and values as well as similarcategorical information to suggest more entity synonyms.Through this process, CS3 uses several heuristics and takesadvantage of currently existing interlinks such as DBpedia’salias, redirect, externalLink, or sameAs links as well as the

interlinks provided by other knowledge bases.In this paper, we describe the following contributions:‚ The Context-aware Synonym Suggestion System (CS3)

which generates synonyms for both entities and attributesin existing knowledge bases. CS3 intuitively uses freetext and existing structured data to learn patterns for sug-gesting attribute synonyms. It also uses several heuristicsto improve existing entity synonyms.

‚ Novel techniques are introduced to integrate several pu-bic knowledge bases and convert them into a generalknowledge base. To this end, we initially collect theknowledge bases and integrate them by exploiting thesubject interlinks they provide. Then, IBminer is usedto find more structured information from free text toextend the coverage of the initial knowledge base. Atthis point, we use CS3 to resolve attribute synonymson the integrated knowledge base, and to suggest moreentity synonyms based on their context similarity andother evidence. This improves performance of semanticsearch over our knowledge base, since more standard andspecific terms are used for both entities and attributes.

‚ We implemented our system and performed preliminaryexperiments on public knowledge bases, namely DBpediaand YaGo, and text from Wikipedia pages. The initialresults so obtained are very promising and show thatCS3 improves the quality and coverage of the existingknowledge bases by applying synonyms in knowledgeintegration. The evaluation results also indicate that IKB-store can reach up to 97% accuracy.

The rest of the paper is organized as follows: In nextsection, we explain the high-level tasks in IKBstore. Then,in Section III, we briefly discuss the techniques to generatestructured information from text in IBminer. In Section IV,we propose CS3 to learn context-aware synonyms. In SectionV, we discuss how these subsystems are used to integrate ourknowledge bases. The preliminary results of our approach arepresented in Section VI. We discuss some related work inSection VII and conclude our paper in Section VIII.

II. THE BIG PICTURE

As already mentioned, the goal of IKBstore is to integratethe public knowledge bases and create a more consistent andcomplete knowledge base. IKBstore performs four tasks toachieve this goal. Here we elaborate these four tasks in moredetails:

Task A: Collecting publicly available knowledge bases, unify-ing knowledge representation format, and integrating knowl-edge bases using existing interlinks and structured information.Creating the initial knowledge base is actually a straight-forward task (Subsection V-B), since many of the existingknowledge bases are representing their knowledge in RDFformat. Moreover, they usually provide information to interlinka considerable portion of their subjects to those in DBpedia.Thus, we use such information to create the initial integratedknowledge base. For easing our discussion, we refer to initialintegrated knowledge base as initial knowledge base. Although

2

Page 3: Discovering Attribute and Entity Synonyms for Knowledge ...fmdb.cs.ucla.edu/Treports/IKBstroe CSD TR.pdf · a roster of semantic-web applications, including queries ex-pressed in

UCLA Computer Science Department Technical Report #130013

this naive integration may improve the coverage of the initialknowledge base, it still needs considerable improvement inconsistency and quality.

Task B: Completing the initial knowledge base using ac-companying text. In order to do so, we employ the IBminersystem [22] to generate structured data from the free textavailable at Wikipedia or similar resources. IBminer firstgenerated semantic links between entity names in the textusing recently proposed text mining SemScape. Then, IBminerlearns common patterns called Potential Matches (PMs) bymatching the current triples in the initial knowledge base tothe semantic links derived from free text. It then employs thePMs to extract more InfoBox triples from text. These newlyfound triples are then merged with the initial knowledge baseto improve its coverage. Section III provides more informationabout this process.

Task C: Generating a large corpus of context-aware synonyms.Since IBminer learns by matching structured data to themorphological structure in the text, it may find more thanone acceptable matching attribute name for a given link namefrom the text. This in fact implies possible attribute synonyms,and it is the main intuition that CS3 uses to learn attributesynonyms. Based on PM, CS3 creates the Potential AttributeSynonyms (PAS) which is in nature similar to PM. Howeverinstead of mapping link names into attribute names, PASprovides mapping between different attribute names basedon the categorical information of the subject and the value.Similar to the case of IBminer, the categorical informationserves as the contextual information, and improves the finalresults of the generated attribute synonyms As describedin Section IV-A, CS3 improves PAS by learning from thetriples with matching subjects and values but different attributenames in the current knowledge base. CS3 also recognizescontext-aware entity synonyms by considering the categoricalinformation and InfoBoxes of the entities. (Subsection IV-B).

Task D: Realigning attribute and entity names to constructthe final IKBstore. This is indeed the most important step onpreparing the knowledge bases for structured queries. Here,we first use PAS to resolve attribute synonyms in the currentknowledge base. Then, we use the entity synonyms suggestedby CS3 to integrate entity synonyms and their InfoBoxes. Thisstep is covered in Section V.

Applications: IKBstore can benefit a wide variety of ap-plications, since it covers a large number of structured sum-maries represented with a standard terminology. Knowledgeextraction and population systems such as IBminer [22] andOntoMiner [23], knowledge browsing tools such as DBpediaLive [1] and InfoBox Knowledge-Base Browser (IBKB)[20],and semantic web search such as Faceted Search [5] and By-Example Structured queries [6] are three prominent examplesof such applications. In particular for semantic web search,IKBstore improves the coverage and accuracy of structuredqueries due to superior quality and coverage with respect toexisting knowledge bases. Moreover, IKBstore can serve as

a common ground for different contributors to improve theknowledge bases in a more standard and effective way. Usingmultiple knowledge bases in IKBstore can also be a goodmean for verifying the correctness of the current structuredsummaries as well as those generated from the text.

III. FROM TEXT TO STRUCTURED DATA

To perform the nontrivial task of generating structureddata from text, we use our IBminer system [22]. Although,IBminer’s process is quite complex, we can divide it intothree high-level steps which are elaborated in this section.The first step is to parse the sentences in text and convertthem to a more machine friendly structure called TextGraphswhich contain grammatical and semantic links between entitiesmentioned in the text. As discussed in subsection III-A, thisstep is performed by the NLP-based text mining frameworkSemScape [21], [22]. The second step is to learn a structurecalled Potential Match (PM). As explained in Subsection III-B,PM contains context-aware potential matches between seman-tic links in the TextGraphs and exiting InfoBox items. As thethird step, PM is used to suggest the final structured summaries(InfoBoxes) from the semantic links in the TextGraphs. Thisphase is described in Subsection III-C.

A. From Text to TextGraphs

To generate TextGraphs from text, we employ the SemScapesystem which uses morphological information in the text tocapture the categorical, semantic, and grammatical relationsbetween words and terms in the text. To understand the generalidea, consider the following sentence:

Motivating Sentence: “Johann Sebastian Bach (31 March1685 - 28 July 1750) was a German composer, organist,harpsichordist, violist, and violinist of the Baroque Period.”

There are several entity names in this sentence (e.g. ‘JohannSebastian Bach’, ‘31 March 1685’, and ‘German composer’).The first step in SemScape is to recognize these entity names.Thus, SemScape parses the sentence with Stanford parser[19] which is a probabilistic parser. Using around 150 tree-based patterns (rules), SemScape finds such entity names andannotates nodes in the parse trees with possible entity namesthey contain. These annotations are called MainParts (MPs),and the annotated parser tree is referred to as an MP Tree.With the nodes annotated with their MainParts, other rules donot need to know the underlying structure of the parse treesat each node. As a result, one can provide simpler, fewer, andmore general rules to mine the annotated parse trees.

Next, SemScape uses another set of tree-based patterns tofind grammatical connections between words and entity namesin the parse trees and combine them into the TextGraph. Onesuch TextGraph for our motivating sentence is shown in Figure1. Currently SemScape contains more than 290 manuallycreated rules to perform this step. Each link in the TextGraphis also assigned a confidence value indicating SemScape’sconfidence on the correctness of the link. We should point out

3

Page 4: Discovering Attribute and Entity Synonyms for Knowledge ...fmdb.cs.ucla.edu/Treports/IKBstroe CSD TR.pdf · a roster of semantic-web applications, including queries ex-pressed in

UCLA Computer Science Department Technical Report #130013

Fig. 1. Part of the TextGraph for our motivating sentence.

that TextGraphs support nodes consisting of multiple nodes(and links) through its hyper-links which mainly differentiatesTextGraphs from similar structures such as dependency trees.For more details on the TextGraph generation phase, readersare referred to [22].

Although useful for some applications, the grammaticalconnections at this stage of the TextGraphs are not enoughfor IBminer to generate structured data. The reason is that,IBminer needs connections between entity names, which werefer to them as Semantic links. Semantic links simply specifyany relation between two entities1. With this definition, mostof the grammatical links in the TextGraphs are not semanticlinks. To generate more semantic links, IBminer uses a setof manually created graph-based patterns (rules) over theTextGraphs. One such rule is provided below:

——————————– Rule 1. ——————————–SELECT ( ?1 ?3 ?2 )WHERE{

?1 “subj of” ?3.?2 “obj of” ?3.NOT(”not” “prop of” ?3).NOT(”no” “det of” ?1).NOT(”no” “det of” ?2).

}————————————————————————–

As depicted in the part a) of Figure 2, the pattern graph(WHERE clause of Rule 1) specifies two nodes with (variable)names ?1 and ?2 which are connected to a third node (?3)respectively with subj of and obj of links. One possible matchfor this pattern in the TextGraph of our running example isdepicted in part b) of Figure 2. It is worthy mentioning thatdue to the structure of the TextGraphs, matching multi-wordentities to the variable names in the patterns is an easy taskfor IBminer. This is actually a challenging issue in works suchas [30] which are based on dependency parse trees. Using theSELECT clause Rule 1, the rule returns several triples for ourrunning example such as:

‚ ă‘johann sebastian bach’, ‘was’, ‘composer’ą,‚ ă‘sebastian bach’, ‘was’, ‘composer’ą,‚ ă‘bach’, ‘was’, ‘composer’ą,etc.The above triples are referred to as the initial triples. In

IBminer, we have created 98 graph-based rules to capture

1SemScape treats values as entities

Fig. 2. Part a) shows the graph pattern for Rule1, and part b) depicts oneof the possible matches for this pattern.

semantic links (initial triples) from TextGraphs and add themto the same TextGraph. Some of the semantic links generatedfor our motivating sentence are shown in Figure 3. For thesake of simplicity we do not depict grammatical links in thisgraph. We should restate that all the patterns discussed in thissubsection are manually created and the readers should notconfuse them with automatically generated patterns that wediscuss in the next subsections.

B. Generating Potential Matches (PM)

To generate the final structured data (new InfoBox Triples),IBminer learns an intermediate data structure called PotentialMatches (PM) from the mapping between initial triples (se-mantic links in TextGraphs) and current InfoBox triples. Forinstance consider the initial triple ăBach, was, composerą

in Figure 3. Also assume, the current InfoBoxes includeăBach, Occupation, composerą. This implies a simple patternsaying that link name ‘was’ may be translated to attributename ‘Occupation’ depending on the context of the subject(‘Bach’ in this case) and the value (‘composer’). We refer tothese matches as potential matches and the goal here is toautomatically generate and aggregate all potential matches.

To understand why we need to consider the context of thesubject and the value, this time consider the two initial triplesăBach, was, composerą and ăBach, was, Germaną in theTextGraph in Figure 3. Obviously the link name ‘was’ shouldbe interpreted differently in these two cases, since the formerone is connecting a ‘person’ to an ‘occupation’, while thelatter is between a ‘person’ and a ‘nationality’. Now, considertwo existing InfoBoxes ăBach, occupation, composerą andăBach, nationality, Germaną which respectively match theformerly mentioned initial triples. These two matches implya simple pattern saying link name ‘was’ connecting ‘Bach’and ‘composer’ should be interpreted as ‘occupation’, whileit should be interpreted as ‘nationality’ if it is used between‘Bach’ and ‘German’. To generalize these implications, insteadof using the subject and value names, we use the categoriesthey belong to in our patterns. For instance, knowing that‘Bach’ is in category ‘Cat:Person’ and ‘composer’ and ‘Ger-man’ are respectively in categories ‘Cat:Occupation in Music’and ‘Cat:Ueropean Nationality’, we learn the following twopatterns:

‚ ăCat:Person, was, Cat:Occupation in Musicą: occupation‚ ăCat:Person, was, Cat:European Nationalityą: nationalityHere the pattern ăc1, l, c2ą:α indicates that the link named

l, connecting a subject in category c1 to an entity or value in

4

Page 5: Discovering Attribute and Entity Synonyms for Knowledge ...fmdb.cs.ucla.edu/Treports/IKBstroe CSD TR.pdf · a roster of semantic-web applications, including queries ex-pressed in

UCLA Computer Science Department Technical Report #130013

Fig. 3. Some Semantic links for our motivating sentence.

category c2, may be interpreted as the attribute name α. Notethat for each triple with a matching InfoBox triple, we createseveral patterns since the subject and the values usually belongto more than one (direct or indirect) categories.

More formally, let ăs, l, vą be an initial triple in theTextGraph, which matches InfoBox triple ăs, α, vą in theinitial knowledge base. Also, let s and v respectively belongto category sets Cs “ tcs1, cs2, ...u and Cv “ tcv1, cv2, ...uaccording to the categorical information in Wikipedia. Laterin this subsection, we discuss a simple yet effective techniquefor selecting a small set of related categories for a given subjector value. For each cs P Cs and cv P Cv , IBminer creates thefollowing tuple and add it to PM:

ăcs, l, cvą : αEach tuple in PM is also associated with a confidence value

c (initialized by the confidence of the TextGraph’s semanticlink) and an evidence frequency e (initialized by 1). Morematches for the same categories and the same link will increasethe confidence and evidence count of the above potentialmatch. The above potential match basically means that foran initial triple ăs1, l, v1ą in which s1 belongs to category csand v1 belongs to cv , α may be a match for l with confidencec and evidence e.

Later in Section IV, we show that how CS3 uses the PMstructure to build up a Potential Attribute Synonym structureto generate high-quality synonyms.

Selecting Best Categories: A very important issue ingenerating potential matches is the quality and quantity of thecategories for the subjects and values. The direct categoriesprovided for most of the subjects are too specific and there areonly a few subjects in each of them. As shown in Section VI,generating the potential matches over direct categories is notvery helpful to generalize the matches for new subjects. On theother hand, exhaustively adding all the indirect (or ancestor)categories will result in too many inaccurate potential matches.For instance considering only four levels of categories inWikipedia’s taxonomy, the subject ‘Johann Sebastian Bach’belongs to 422 categories. In this list, there are some usefulindirect categories such as ‘German Musicians’ and ‘GermanEntertainers’, as well as several categories which are either too

general or inaccurate (e.g. ‘People by Historical Ethnicity’ and‘Centuries in Germany’). Considering the same issue for thevalue part, hundreds of thousands of potential matches may begenerated for a single subject. This issue not only wastes ourresources, but also impacts the accuracy of the final results.

To address this issue, we use a flow-driven technique to rankall the categories to which subject s belongs, and then selectbest NC categories. The main intuition is to propagate flows orweights through different pathes from s to each category, thecategories receiving more weights are considered to be morerelated to s. Now, L being the number of allowed ancestorlevels, we create the categorical structure for s up to L levels.Starting with node s as the root of this structure and assigningweight 1.0 to it, we iteratively select the closest node to s,which has not been processed yet, propagate its weight toits parent categories, and mark it as processed. To propagateweights of node ci with k ancestors, we increase the currentweights of each k ancestors with wi{k, where wi is the currentweight of node ci. Although wi may change even after ci isprocessed, we will not re-process ci after any further updateson its weight. After propagating the weight to all the nodes, weselect the top NC categories for generating potential matchesand attribute synonyms.

C. Generating New Structured Summaries

To extract new structured summaries (InfoBox triples),IBminer uses PMs to translate the link names of the initialtriples into the attribute names of InfoBoxes. Let t “ăs, l,vą be the initial triple whose link (l) needs to be translated,and s and v are listed in category sets Cs “ tcs1, cs2, ...u andCv “ tcv1, cv2, ...u respectively which are generated based onour category selection algorithm. The key idea to translate lis to take a consensus among all pair of categories in Cs andCv and all possible attributes to decide which attribute nameis a possible match.

To this end, for each cs P Cs and cv P Cv , IBminer finds allpotential matches such as ăcs, l, cvą: αi. The resulting setof potential matches are then grouped by the attribute names,αi’s, and for each group we compute the average confidencevalue and the aggregate evidence frequency of the matches.IBminer uses two thresholds at this point to discard low-confident (named τc) or low-frequent (named τe) potentialmatches, which are discussed in Section VI. Next, IBminerfilters the remaining results by a very effective type-checkingtechnique2. At this point, if there exists one or more matchesin this list, we pick the one with the largest evidence value, saypmmax, as the only attribute map and report the new InfoBoxtriple ăs, pmmax.ai, vą with confidence t.c ˆ pmmax.c andevidence tn.e. Secondary possible matches are considered inthe next section as attribute synonyms.

IV. CONTEXT-AWARE SYNONYMS

Synonyms are terms describing the same concept, whichcan be used interchangeably. According to this definition, no

2See [22] for details on the type-checking mechanism used by IBminer

5

Page 6: Discovering Attribute and Entity Synonyms for Knowledge ...fmdb.cs.ucla.edu/Treports/IKBstroe CSD TR.pdf · a roster of semantic-web applications, including queries ex-pressed in

UCLA Computer Science Department Technical Report #130013

matter what context is used, the synonym for a term is fixed(e.g. ‘birthdate’ and ‘date of birth’ are always synonyms).However, the meaning or semantic of a term usually dependson the context in which the term is used. The synonymalso varies as the context changes. For instance, in an articledescribing IT companies, the synonym of the attribute name‘wasCreatedOnDate’ most probably is ‘founded date’. In thiscase, knowing that the attribute is used for the name of acompany is a contextual information that helps us find anappropriate synonym for ‘wasCreatedOnDate’. However, ifthis attribute is used for something else, such as an invention,one can not use the same synonym for it.

Being aware of the context is even more useful for resolvingpolynymous phrases, which are in fact much more prevalentthan exact synonyms in the knowledge bases. For example,consider the entity/subject name ‘Johann Sebastian Bach’.Due to its popularity, a general understanding is that the entityis describing the famous German classical musician. However,what if we know that for this specific entity the birthplace is in‘Berlin’. This simple contextual information will lead us to theconclusion that the entity is refereing to the painter who wasactually the grandson of the famous musician Johann SebastianBach. A very similar issue exists for the attribute synonyms.For instance considering attribute ‘born’, ‘birthdate’ can bea synonym for ‘born’ when it is used with a value of type‘date’; but if ‘born’ is used with values which indicate places,then ‘birthplace’ should be considered as its synonym.CS3 constructs a structure called Potential Attribute Syn-

onyms (PAS) to extract attribute synonym. In the generationof PAS, CS3 essentially counts the number of times each pairof attributes are used between the same subject and value andwith the same corresponding semantic link in the TextGraphs.The context in this case is considered to be the categoricalinformation for the subject and the value. These numbersare then used to compute the probability that any given twoattributes are synonyms. Next subsection describes the processof generating PAS. Later in Subsection IV-B, we will discussour approach to suggest entity synonyms and improve existingones.

A. Generating Attribute Synonyms

Intuitively, if two attributes (say ‘birthdate’ and ‘dateOf-Birth’) are synonyms in a specific context, they should berepresented with the same (or very similar) semantic links inthe TextGraphs (e.g. with semantic links such as ‘was bornon’, ‘born on’, or ‘birthdate is’). In simpler words, we usetext as the witness for our attribute synonyms. Moreover, thecontext, which is defined as the categories for the subjects(and for the values), should be very similar for synonymousattributes.

More formally, let attributes αi and αj be two matches forlink l in initial triple ăs, l, vą. Let Ni,j (“ Nj,i) be the totalnumber of times both αi and αj are the interpretation of thesame link (in the initial triples) between category sets Cs andCv . Also, let Nx be the total number of time αx is usedbetween Cs and Cv . Thus the probability that αi (αj) is a

synonym for αj (αi) can be computed by Ni,j{Nj (Ni,j{Ni).Obviously this is not always a symmetric relationship (e.g.‘born’ attribute is always a synonym for ‘birthdate’, but not theother way around, since ‘born’ may also refer to ‘birthplace’or ‘birthname’ as well). In other words having Ni and Ni,j

computed, we can resolve both synonyms and polynyms forany given context (Cs and Cv).

With the above intuition in mind, the goal in PAS is tocompute Ni and Ni,j . Next we explain how CS3 constructsPAS in one-pass algorithm which is essential for scaling upour system. For each two records in PM such as ăcs, l, cvą:αi and ăcs, l, cvą: αj respectively with evidence frequencyei and ej (ei ď ej), we add the following two records to PAS:

ăcs, αi, cvą: αjăcs, αj , cvą: αi

Both records are inserted with the same evidence frequencyei. Note that, if the records are already in the current PAS,we increase their evidence frequency by ei. At the very sametime we also count the number of times each attribute is usedbetween a pair of categories. This is necessary for estimatingNi and computing the final weights for the attribute synonyms.That is for the case above, we add the following two PASrecords as well:

ăcs, αi, cvą: ‘’ (with evidence ei)ăcs, αj , cvą: ‘’ (with evidence ej)

Improving PAS with Matching InfoBox Items: Po-tential attribute synonyms can be also derived from dif-ferent knowledge bases which contain the same pieceof knowledge, but in different attribute names. For in-stance let ăJ.S.Bach, birthdate, 1685ą and ăJ.S.Bach,wasBornOnDate, 1685ą be two InfoBox triples indi-cating bach’s birthdate. Since the subject and value partof the two triples matches, one may say birthdate andwasBornOnDate are synonyms. To add these types ofsynonyms to the PAS structure, we follow the exact same ideaexplained earlier in this section. That is, consider two triplessuch as ăs, αi, vą and ăs, αj , vą in which αi and αj may bea synonym. Also, let s and v respectively belong to categorysets Cs “ tcs1, cs2, ...u and Cv “ tcv1, cv2, ...u. Thus, for allcs P Cs and cv P Cv we add the following triples to PAS:

ăcs, αi, cvą: αj (with evidence 1)ăcs, αj , cvą: αi (with evidence 1)

This intuitively means that from the context (category) ofcs to cv , attributes αi and αj may be synonyms. Againmore examples for these categories and attributes increasethe evidence which in turn improve the quality of the finalattribute synonyms. Much in the same way as learning frominitial triples, we count the number of times that an attributeis used between any possible pair of categories (cs and cv) toestimate Ni.

Generating Final Attribute Synonyms: Once PAS struc-ture is built, it is easy to compute attribute synonyms asdescribed earlier. Assume we want to find best synonyms forattribute αi in InfoBox Triple t=ăs, αi, vą. Using PAS, for

6

Page 7: Discovering Attribute and Entity Synonyms for Knowledge ...fmdb.cs.ucla.edu/Treports/IKBstroe CSD TR.pdf · a roster of semantic-web applications, including queries ex-pressed in

UCLA Computer Science Department Technical Report #130013

all possible αj , all cs P Cs, and all cv P Cv , we aggregatethe evidence frequency (e) of records such as ăcs, αi, cvą:αj in PAS to compute Ni,j . Similarly, we compute Nj byaggregating the evidence frequency (e) of all records in formof ăcs, αi, cvą: ‘’. Finally, we only accept attribute αj asthe synonym of αj , if Ni,j{Ni and Ni,j are respectively abovepredefined thresholds τsc and τse. We study the effect of suchthresholds in Section VI.

B. Generating Entity Synonyms

There are several techniques to find entity synonyms.Approaches based on the string similarity matching [24],manually created synonym dictionaries [28], automaticallygenerated synonyms from click log [14], [15], and synonymsgenerated by other data/text mining approaches [29], [23] areonly a few examples of such techniques. Although performingvery well on suggesting context-independent synonyms, theydo not explicitly consider the contextual information for sug-gesting more appropriate synonyms and resolving polynyms.

Very similar to context-aware attribute synonyms in whichthe context of the subject and value used with an attributeplays a crucial role on the synonyms for that attribute, we candefine context-aware entity synonyms. For each entity name,CS3 uses the categorical information of the entity as well as allthe InfoBox triples of the entity as the contextual informationfor that entity. Thus to complete the exiting entity synonymsuggestion techniques, for any suggested pair of synonymousentities, we compute entities context similarity to verify thecorrectness of the suggested synonym.

It is important to understand that this approach should beused as a complementary technique over the existing ones fortwo main reasons. First, context similarity of two entities doesnot always imply that they are synonyms specially when manypieces of knowledge are missing for most of entities in thecurrent knowledge bases. Second, it is not feasible to computethe context similarity of all possible pairs of entities due tothe large number of existing entities. In this work, we use theOntoMiner system [23] in addition to simple string matchingtechniques (e.g. Exact string matching, having common words,and edit distance) to suggest initial possible synonyms.

Let ‘Johann Sebastian Bach’ and ‘J.S. Bach’ be two syn-onyms that two different knowledge bases are using to denotethe famous musician. A simple string matching would offerthese two entity as being synonyms. Thus we compare theircontextual information and realize that they have many com-mon attributes with similar values for them (e.g. same valuesfor attributes occupation, birthdate, birthplace, etc.). Also theyboth belong to many common categories (e.g. Cat:Germanmusician, Cat:Composer, Cat:people, etc.). Thus we suggestthem as entity synonyms with high confidence. However,consider ‘Johann Sebastian Bach (painter)’ and ‘J.S. Bach’ en-tities. Although the initial synonym suggestion technique maysuggest them as synonyms, since their contextual informationis quite different (e.i. they have different values for commonattributes occupation, birthplace, birthdate, deathplace, etc.)our system does not accept them as being synonyms.

V. COMBINING KNOWLEDGE BASES

The IBminer and CS3 systems allow us to integrate theexisting knowledge bases into one of superior quality andcoverage. In this section, we elaborate the steps of IKBstoreas introduced previously in Section II.

A. Data Gathering

We are currently in the process of integrating knowledgebases listed in Table I, which include some domain spe-cific knowledge bases (e.g. MusicBrainz [3], Geonames [2],etc.), and some domain independent ones (e.g. DBpedia [7],YaGo2 [18], etc.). Although most knowledge bases such asDBPedia, YaGo, and FreeBase already provide there knowl-edge in RDF, some of them may use other representations.Thus, for all knowledge bases, we convert their knowledgeinto ăSubject, Attribute, Valueą triples and store them inIKBstore3. IKBstore is currently implemented over ApacheCassandra which is designed for handling very large amountof data. IKBstore recognizes three main types of information:

‚ InfoBox triples: These triples provide information ona known subject (subject) in the ăsubject, attribute,valueą format. E.g. ăJ.S. Bach, PlaceofBirth, Eise-nachą which indicates the birthplace of the subjectJ.S.Bach is Eisenach. We refer to these triples as InfoBoxtriples.

‚ Subject/Category triples: They provide the categories thata subject belongs to in the form of subject/link/categorywhere, link represents a taxonomical relation. E.g.ăJ.S.Bach, is in, Cat:German Composersą which in-dicates the subject J.S.Bach belongs to the categoryCat:German Composers.

‚ Category/Category triples: They represent taxonomi-cal links between categories. E.g. ăCat:German Com-posers, is in, Cat:German Musiciansą which indicates thecategory Cat:German Composers is a sub-category ofCat:German Musicians.

Currently, we have converted all the knowledge bases listedin Table I into above triple formats.

IKBstore also preserves the provenance of each piece ofknowledge. In other words, for every fact in the integratedknowledge base, we can track its data source. For the factsderived from text, we record the article ID as the prove-nance. Provenance has several important applications suchas restricted search on specific sources, tracking erroneousknowledge pieces to the emanating source, and better rankingtechniques based on reliability of the knowledge in eachsource. In fact, we also annotate each fact with accuracy con-fidence and frequency values, based on the provenance of thefact. To do so, each knowledge base is assigned a confidencevalue. To compute the frequency and confidence of individualfacts, we respectively count the number of knowledge bases

3For instance MusicBraneZ uses relational representation, for which weconsider every column name, say α, as an attribute connecting the main entity(s) of a row in the table to value (v) of that row for column α, and createtriple ăs, α,vą.

7

Page 8: Discovering Attribute and Entity Synonyms for Knowledge ...fmdb.cs.ucla.edu/Treports/IKBstroe CSD TR.pdf · a roster of semantic-web applications, including queries ex-pressed in

UCLA Computer Science Department Technical Report #130013

including the fact and combine the confidence value of themas explained in [22].

B. Initial Knowledge Integration

The aim of this phase is to find the initial interlinks betweensubjects, attributes, and categories from the various knowledgesources to eliminate duplication, align attributes, and reduceinconsistency using only the existing interlinks. At the endof this phase, we have an initial knowledge base which isnot quite ready for structured queries, but provides a betterstarting point for IBminer to generate more structured dataand for CS3 to resolve attribute and entity synonyms.

‚ Interlinking Subjects: Fortunately, many subjects in dif-ferent knowledge bases have the same name. Moreover,DBPedia is interlinked with many existing knowledgebases, such as YaGo2 and FreeBase, which can serve asa source of subject interlinks. For the knowledge baseswhich do not provide such interlinks (e.g. NELL), inaddition to exact matching, we parse the structured part ofknowledge base to derive candidate interlinks for existingentities, such as redirect and sameAs links in Wikipedia.

‚ Interlinking Attributes: As we mentioned previously,attributes interlinks are completely ignored in the currentstudies. In this phase, we only use exact matching forinterlinking attributes.

‚ Interlinking Categories: In addition to exact matching,we compute the similarity of the categories in differ-ent knowledge bases based on their common instances.Consider two categories c1 and c2, and let Spcq be theset of subjects in category c. The similarity function forcategories interlink is defined as Simpc1, c2q “ |Spc1q X

Spc2q|{|Spc1qYSpc2q|. If the Simpc1, c2q is greater thana certain threshold, we consider c1 and c2 as aliasesof each other, which simply means that if the instancesof two categories are highly overlapping, they might berepresenting the same category.

After retrieving these interlinks, we merge similar entities,categories, and triples based on the retrieved interlinks. Theprovenance information is generated and stored along with thetriples.

C. Final Knowledge Integration

Once the initial knowledge base is ready, we first employIBminer to extract more structured data from accompanyingtext and then utilize CS3 to resolve synonymous informationand create the final knowledge base. More specifically, weperform the following steps in order to complete and integratethe final knowledge base:

‚ Improving Knowledge base coverage: The web doc-uments contain numerable facts which are ignored byexisting knowledge bases. Thus, we first enrich theknowledge base by employing our knowledge extractionsystem IBminer. As described in Section III, IBminerlearns PM from free text and the initial knowledge baseto derive more triples which will greatly improve the

coverage of existing knowledge bases. These new triplesare then added to IKBstore. For each generated triple,we also update the confidence and evidence frequencyin IKBstore. That is if the triple is already in IKBstore,we only increase and update its confidence and evidencefrequency.

‚ Realigning attribute names: Next we employ CS3 tolearn PAS and generate synonyms for attribute names andexpand the initial knowledge base with more common andstandard attribute names.

‚ Matching entity synonyms: This step merges the entitiesbase on the entity synonyms suggested by CS3. For thesuggested synonym entities such as s1, s2, we aggregatetheir triples and use one common entity name, say s1. Theother subject (s2) is considered as a possible alias for s1,which can be represented by RDF triple ăs1, alias, s2ą.

‚ Integrating categorical information: Since we havemerged subjects based on entity synonyms, the similarityscore of the categories may change and thus we needto rerun the category integration described in SubsectionV-B.

Currently the knowledge base integration is partially com-pleted for DBpedia and YaGo2 as described in Section VI.

VI. EXPERIMENTAL RESULTS

In this section, we test and evaluate different steps ofcreating IKBstore in terms of precision and recall. To this end,we create an initial knowledge base using subjects listed inWikipedia for three specific domains (Musicians, Actors, andInstitutes). For these subjects, we add their related structureddata from DBpedia and YaGo2 to our initial knowledgebases. Then, to learn PM and PAS structures, we use theentire Wikipedia’s long abstracts provided by DBpedia for thementioned subjects. We should state that IBminer only usesthe free text and thus can take advantage of any other sourceof textual information.

All the experiments are performed in a single machinewith 16 cores of 2.27GHz and 16GB of main memory. Thismachine is running Ubuntu12. On average, SemScape spends3.07 seconds on generating initial semantic links for eachsentence on a singe CPU. That is, using 16 cores, we wereable to generate initial semantic links for 5.2 sentences persecond.

A. Initial Knowledge Base

To evaluate our system, we have created a simple knowledgebase which consists of three data sets for domains Musicians,Actors, and Institutes (e.g. universities and colleges) from thesubjects listed in Wikipedia. These data sets do not share anysubject, and in total they cover around 7.9% of Wikipediasubjects. To build each data set, we start from a generalcategory (e.g. Category:Musicians for Musicians data set) andfind all the Wikipedia subjects in this category or any of itsdescendant categories up to four levels. Table II provides moredetails on each data set.

8

Page 9: Discovering Attribute and Entity Synonyms for Knowledge ...fmdb.cs.ucla.edu/Treports/IKBstroe CSD TR.pdf · a roster of semantic-web applications, including queries ex-pressed in

UCLA Computer Science Department Technical Report #130013Data set Subjects InfoBox Subjects with Subjects with SentencesName Triples Abstract InfoBox per AbstractMusicians 65835 687184 65476 52339 8.4Actors 52710 670296 52594 50212 6.2Institutes 86163 952283 84690 54861 5.9

TABLE IIDESCRIPTION OF THE DATA SETS USED IN OUR EXPERIMENTS.

In our initial knowledge base, we currently use InfoBoxtriples from DBpedia and YaGo2 for the mentioned subjects.Using multiple knowledge bases improves the coverage ofour initial knowledge base. However in many cases they usedifferent terminologies for representing the attribute and entitynames. Although our initial knowledge base contains threedata sets and we have performed all four tasks on these datasets, we only report the evaluation results for the Musiciansdata set here. This is mainly due to the time-consumingprocess of manually grading the results.

To evaluate the quality of existing knowledge base, werandomly selected 50 subjects (which have both abstract andInfoBox) from the Musicians data set. For each subject, wecompared the information provided in its abstract and in itsInfoBox. For these 50 subjects, 1155 unique InfoBox triples(and 92 duplicate triples) are listed in DBpedia. Interestingly,only 305 of these triples are inferable from the abstract text(only 26.4% of the InfoBoxes are covered by the accom-panying text.) More importantly, we have found 47 wrongtriples (4.0%) and 146 (12.6%) triples which are not useful interms of main knowledge (e.g. triples specifying upload filename, image name, format, file size, or alignment). Thus anoptimistic estimation is that at most 76% of facts providedin DBpedia are useful for semantic-related applications. Wemay call this portion of the knowledge base as useful in thissection. Moreover, the accuracy of DBpedia in this case is lessthan 96%.

B. Completing Knowledge by IBminer

Using the Musicians data set and considering 40 categories(NC “ 40) and 4 levels (L “ 4), we trained the PotentialMatch (PM) structure using IBminer system. Total number ofinitial triples for this chunk of data set is |Tn| “ 4.72M ,while 52.9K of them match with existing InfoBoxes (|Tm| “

52.9K). Using PM and the initial triples, we generate the finalInfoBox triples without setting any confidence and evidencefrequency threshold (i.e. τe “ 0 and τc “ 0.0). Later in thissection, we also analyze the effect of NC and L selection onthe results.

To estimate the accuracy of the final triples, we randomlyselect 20K of the generated triples and carefully grade themby matching against their abstracts. Many similar systems suchas [9], [17], [31] have also used manual grading due to thelack of a good benchmark. As for the recall, we investigateexisting InfoBoxes and compute what portion of them is alsogenerated by IBminer. This gives only a pessimistic estimationof the recall ability, since we do not know what portion ofthe InfoBoxes in Wikipedia are covered or mentioned in the

text (long abstract for our case). To have a better estimationfor recall, we only used those InfoBox triples which matchat least to one of our initial triples. In this way, we estimatebased on InfoBox triples which are most likely mentioned inthe text. Nevertheless, this automatic technique provides anacceptable recall value which is enough for comparison andtuning purposes.

Best Matches: To ease our experiments, we combine ourtwo thresholds (τe and τc) by multiplying them together andcreate a single threshold called τ (=τeˆτc). Part a) in Figure 4depicts the precision/recall diagram for different thresholds onτ in Musicians data set. As can be seen, for the first 15% ofcoverage, IBminer is generating only correct information. Forthese cases τ is very high. According to this diagram, to reach97% precision which is higher than DBpedia’s precision, oneshould set τ to 6,300 (« .12|Tm|). For this case as shown inPart c) of the figure, IBminer generates around 96.7K tripleswith 33.6% recall.

Secondary Matches: For each best match found in the previ-ous step, say t “ăs, αi, vą, we generate attribute synonymsfor αi using PAS. If any of the reported attribute synonyms arenot in the list of possible matches for t (See Subsection III-C), we ignore the synonym. Considering τe=.12|Tm|, precisionand recall of the remaining synonyms are computed similar tothe best match case and depicted in part b) of Figure 4 (whilethe potential attribute synonym evidence count (τse) decreasesfrom right to left). As the figure indicates, for instance, to have97% accuracy one need to set τse to 12,000(« .23|Tm|). Weshould add that although synonym results for this case indicateonly 3.6% of coverage, the number of correct new triples thatthey generate are 53.6K which is quite acceptable.

Part c) of Figure 4 summarizes the number of best matchtriples, the total number of generated items (both Best andSecondary Matches), and the total number of correct itemsfor some possible τ . For instance for τ=.4|Tm|, we reach upto 97% accuracy while the total number of generated resultsis more than 150K (96.7K ` 53.6K). This indicates 28.7%improvement in the coverage of the useful part of our Initialknowledge base while the accuracy is above 97%. This isactually very impressive considering the fact that we only usethe abstract of the page.

C. The Impact of Category Selection

To understand the impact of category selection technique,we repeat the previously mentioned experiment on the Musi-cians data set for different levels (L) and category numbers(NC). Here, we only compare the results based on the recallestimation. First, we fix the NC at 40 and change L from 1to 4. Part a) of Figure 5 depicts the estimated recall (onlyfor the best matches) while PM’s frequency threshold (τe)increases. Not surprisingly, as L increases, the recall improvessignificantly, since more general categories are considered inPM construction. On the other hand, using more categoriesalso improves the recall as depicted in Part b) of the figure.Then we fix L at 4 and change NC from 10 to 50. The results

9

Page 10: Discovering Attribute and Entity Synonyms for Knowledge ...fmdb.cs.ucla.edu/Treports/IKBstroe CSD TR.pdf · a roster of semantic-web applications, including queries ex-pressed in

UCLA Computer Science Department Technical Report #130013

Fig. 4. Results for Musicians data set: a) Precision/Recall diagram for best matches, b) Precision/Recall diagram for attribute synonyms, and c) the size ofgenerated results for the test data set.

indicate that even with 10 categories, we may reach good recallwith respect to other cases. This mainly proves the efficiencyof our category selection technique. However, to have highrecall for lower NCs, one might want to use very low valuesfor τe which will result in lower accuracy.

In Part c) of Figure 5, we compare the time performance ofall the above cases. This diagram shows the average time spenton generating final InfoBoxes for each abstract. As we increasethe levels, the delay increases exponentially since IBminerperforms more database accesses to generate the categoriesstructure. However, the delay linearly and insignificantly in-creases with increase in category numbers. This is mainly dueto the fact that the increase in the number of categories doesnot require any extra database access. The delay for generatingPM also follows the exact same trend.

D. Completing Knowledge by Attribute Synonyms

In order to compute the precision and recall for the attributesynonyms generated by CS3, we use the Musicians data setand construct the PAS structure as described in section IV.Using PAS, we generated possible synonyms for 10, 000InfoBox items in our initial knowledge base. Note that thesesynonyms are for already existing InfoBoxes which differen-tiate them from secondary matches evaluated in SubsectionVI-B. This has generated more than 14, 900 synonym items.Among the 10, 000 InfoBox items, 1263 attribute synonymswere listed and our technique generated 994 of them. We usedthese matches to estimate the recall value of our technique fordifferent frequency thresholds (τcs) as shown in Figure 6. Asfor the precision estimation, we manually graded the generatedsynonyms.

As can be seen in Figure 6, CS3 is able to find more than74% of the possible synonyms with more than 92% accuracy.In fact, this is a very big step in improving structured queryresults, since it increase the coverage of the IKBstore by atleast 88.3%. This in some sense improves the consistency ofthe knowledge base by providing more synonymous InfoBoxtriples. In aggregate with the improvement we achieved byIBminer, we can state that our IKBstore doubles the size ofthe current knowledge bases while preserving their precision(if not improving) and improving their consistency.

VII. RELATED WORK

There are several attempts to generate large-scale knowledgebases [4], [10], [27], [8], [7], [2], [3]. However, none ofthem are providing any automatic integration technique amongexisting knowledge bases. There are only a few recent effortson integrating knowledge bases in both domain-specific andgeneral topics. GeoWordNet [16] is an integrated knowledgebase of GeoNames [2], WordNet [28], and MultiWordNet [26].However, the integration is based on the manually createdconcept mapping which is time-consuming and error-pronefor large scale knowledge base integration. In [18], Yago2is integrated with GeoNames and structured information inWikipedia. The category (class) mapping in Yago2 is per-formed by simple string matching which is not reliable forlarge taxonomies such as the one in Wikipedia. Recently, Wuet al. proposed Probase [31] which aims to generate a generaltaxonomy from web documents. Probase merged the conceptsand instances into a large general taxonomy. In taxonomyintegration, the concepts with similar context properties (e.g.derived from the same sentence or with similar child nodes)are merged.

Another line of related studies is the synonym suggestionsystems, which mostly focus on entity synonyms and ignoreattribute synonyms. One of the early work in this area isWordNet [28] which provides manually defined synonyms forgeneral words. Words and very general terms in WordNet aregrouped to set of synonyms called Synset. Although WordNet

Fig. 6. The Precision/Recall diagram for the attribute synonyms generatedfor existing InfoBoxes in the Musicians data set.

10

Page 11: Discovering Attribute and Entity Synonyms for Knowledge ...fmdb.cs.ucla.edu/Treports/IKBstroe CSD TR.pdf · a roster of semantic-web applications, including queries ex-pressed in

UCLA Computer Science Department Technical Report #130013

Fig. 5. a) The impact of increasing level number on recall, b) The impact of increasing number of categories on recall, and c) InfoBox generation delay perabstract.

contains 117,000 high quality Synsets, it is still limited toonly general words which miss most of multi-word terms,and as a result it is not effective to be used in large scaleknowledge base integration systems. Another approach toextract synonyms is based on approximate string matching(fuzzy string searching) [24]. This technique is effective indetecting certain kinds of format-related synonyms, such asnormalization and misspelling. However, this technique cannot discover semantic synonyms.

To overcome the shortcomings of the aforementioned ap-proaches, more automatic approaches for generating entitysynonyms from web scale documents were presented in recentyears. In [12], [13], Chaudhuri et al. proposed an approachto generate a set of synonyms which are substrings of givenentities. This approach exploits the correlation between thesubstrings and the entities in the web documents to identifythe candidate synonyms. In [14], [15], Cheng et al. used queryclick logs in web search engine to derive synonyms. First,according to the click logs, every entity is associated witha set of relevant web pages. Then, the queries which resultin certain web pages are considered as possible synonymsfor the associated entities of those pages. The synonyms arethen refined by evaluating the click patterns of queries. Laterin [11], Chakrabarti et al. refined this approach by definingsimilarity functions, which solves click log sparsity problemand verifies that candidate synonym string is of the sameclass of the entity. In [29], the author proposed a machinelearning algorithm, called PMI-IR to mine synonyms from webdocuments. In PMI-IR, if two sets of documents are correlated,the words associated with the two sets are taken as candidatesynonyms.

VIII. CONCLUSION

In this paper, we propose the IKBstore technique to integratecurrently existing knowledge bases into a more completeand consistent knowledge base. IKBstore first merges existingknowledge bases using the interlinks they provide. Then, itemploys the text-based knowledge discovery system IBminerto improve the coverage of the initially integrated knowledgebases by extracting knowledge from free text. Finally, IKB-store utilizes the CS3 to extract context-aware attribute andentity synonyms and uses them to reconcile among differentterminologies used by different knowledge bases. In thisway, IKBstore creates an integrated knowledge base which

outperforms existing knowledge bases in terms of coverage,accuracy, and consistency. Our preliminary evaluation alsoproves this claim as it shows that IKBstore doubles the cover-age of the existing knowledge base while slightly improvingaccuracy and resolving many inconsistencies. The accuracy ofthe generated knowledge at this point is above 97%.

As for the future work, we are currently expanding IKBstoreand our goal is to cover all the subjects in Wikipedia. We arealso working on improving the integration of the categoricaland taxonomical information available in different knowledgebases.

REFERENCES

[1] Dbpedia live. http://live.dbpedia.org/.[2] Geonames. http://www.geonames.org.[3] Musicbrainz. http://musicbrainz.org.[4] Opencyc. http://www.cyc.com/platform/opencyc.[5] W. Abramowicz and R. Tolksdorf, editors. Business Informa-

tion Systems, 13th International Conference, BIS 2010, Berlin,Germany, May 3-5, 2010. Proceedings, volume 47 of LectureNotes in Business Information Processing. Springer, 2010.

[6] M. Atzori and C. Zaniolo. Swipe: searching wikipedia byexample. In WWW (Companion Volume), pages 309–312, 2012.

[7] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cy-ganiak, and S. Hellmann. Dbpedia - a crystallization point forthe web of data. J. Web Sem., 7(3):154–165, 2009.

[8] K. D. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor.Freebase: a collaboratively created graph database for structuringhuman knowledge. In SIGMOD, 2008.

[9] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. H. Jr., andT. Mitchell. Toward an architecture for never-ending languagelearning. In AAAI, pages 1306–1313, 2010.

[10] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr.,and T. M. Mitchell. Toward an architecture for never-endinglanguage learning. In AAAI, 2010.

[11] K. Chakrabarti, S. Chaudhuri, T. Cheng, and D. Xin. A frame-work for robust discovery of entity synonyms. In Proceedings ofthe 18th ACM SIGKDD international conference on Knowledgediscovery and data mining, KDD ’12, pages 1384–1392, NewYork, NY, USA, 2012. ACM.

[12] S. Chaudhuri, V. Ganti, and D. Xin. Exploiting web searchto generate synonyms for entities. In Proceedings of the 18thinternational conference on World wide web, WWW ’09, pages151–160, New York, NY, USA, 2009. ACM.

[13] S. Chaudhuri, V. Ganti, and D. Xin. Mining document collec-tions to facilitate accurate approximate entity matching. Proc.VLDB Endow., 2(1):395–406, Aug. 2009.

[14] T. Cheng, H. W. Lauw, and S. Paparizos. Fuzzy matching ofweb queries to structured data. In ICDE, pages 713–716, 2010.

11

Page 12: Discovering Attribute and Entity Synonyms for Knowledge ...fmdb.cs.ucla.edu/Treports/IKBstroe CSD TR.pdf · a roster of semantic-web applications, including queries ex-pressed in

UCLA Computer Science Department Technical Report #130013

[15] T. Cheng, H. W. Lauw, and S. Paparizos. Entity synonymsfor structured web search. IEEE Trans. Knowl. Data Eng.,24(10):1862–1875, 2012.

[16] F. Giunchiglia, V. Maltese, F. Farazi, and B. Dutta. Geowordnet:A resource for geo-spatial applications. In ESWC (1), pages121–136, 2010.

[17] J. Hoffart, F. M. Suchanek, K. Berberich, E. Lewis-Kelham,G. de Melo, and G. Weikum. Yago2: exploring and queryingworld knowledge in time, space, context, and many languages.In WWW, pages 229–232, 2011.

[18] J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum.Yago2: A spatially and temporally enhanced knowledge basefrom wikipedia. Artif. Intell., 194:28–61, 2013.

[19] D. Klein and C. D. Manning. Fast exact inference with afactored model for natural language parsing. In In Advancesin Neural Information Processing Systems 15 (NIPS, pages 3–10. MIT Press, 2003.

[20] H. Mousavi, S. Gao, and C. Zaniolo. Ibminer: A text miningtool for constructing and populating infobox databases andknowledge bases. VLDB (demo track), 2013.

[21] H. Mousavi, D. Kerr, and M. Iseli. A new framework for textualinformation mining over parse trees. In (CRESST Report 805).University of California, Los Angeles, 2011.

[22] H. Mousavi, D. Kerr, M. Iseli, and C. Zaniolo. Deducinginfoboxes from unstructured text in wikipedia pages. In CSDTechnical Report #130001), UCLA, 2013.

[23] H. Mousavi, D. Kerr, M. Iseli, and C. Zaniolo. Ontoharvester:An unsupervised ontology generator from free text. In CSDTechnical Report #130003), UCLA, 2013.

[24] G. Navarro. A guided tour to approximate string matching.ACM Comput. Surv., 33(1):31–88, Mar. 2001.

[25] P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, andV. Vyas. Web-scale distributional similarity and entity setexpansion. In Proceedings of the 2009 Conference on EmpiricalMethods in Natural Language Processing: Volume 2 - Volume2, EMNLP ’09, pages 938–947, Stroudsburg, PA, USA, 2009.Association for Computational Linguistics.

[26] E. Pianta, L. Bentivogli, and C. Girardi. Multiwordnet: devel-oping an aligned multilingual database. In Proceedings of theFirst International Conference on Global WordNet, 2002.

[27] P. Singh, T. Lin, E. T. Mueller, G. Lim, T. Perkins, and W. L.Zhu. Open mind common sense: Knowledge acquisition fromthe general public. In Confederated International ConferencesDOA, CoopIS and ODBASE, London, UK, 2002.

[28] M. M. Stark and R. F. Riesenfeld. Wordnet: An electroniclexical database. In Proceedings of 11th Eurographics Workshopon Rendering. MIT Press, 1998.

[29] P. D. Turney. Mining the web for synonyms: Pmi-ir versus lsaon toefl. In Proceedings of the 12th European Conference onMachine Learning, EMCL ’01, pages 491–502, London, UK,UK, 2001. Springer-Verlag.

[30] F. Wu and D. S. Weld. Open information extraction usingwikipedia. In ACL, pages 118–127, 2010.

[31] W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: a probabilistictaxonomy for text understanding. In SIGMOD Conference,pages 481–492, 2012.

12