A Survey of Schema-based Matching Approachesdisi.unitn.it/~p2p/RelatedWork/Matching/JoDS-IV... · A Survey of Schema-based Matching Approaches Pavel Shvaiko1 and J´erˆome Euzenat2

A Survey of Schema-based Matching Approaches�

Pavel Shvaiko1 and Jerome Euzenat2

1 University of Trento, Povo, Trento, Italy,[email protected]

2 INRIA, Rhone-Alpes, France,[email protected]

Abstract. Schema and ontology matching is a critical problem in many appli-cation domains, such as semantic web, schema/ontology integration, data ware-houses, e-commerce, etc. Many different matching solutions have been proposedso far. In this paper we present a new classification of schema-based matchingtechniques that builds on the top of state of the art in both schema and ontologymatching. Some innovations are in introducing new criteria which are based on(i) general properties of matching techniques, (ii) interpretation of input informa-tion, and (iii) the kind of input information. In particular, we distinguish betweenapproximate and exact techniques at schema-level; and syntactic, semantic, andexternal techniques at element- and structure-level. Based on the classificationproposed we overview some of the recent schema/ontology matching systemspointing which part of the solution space they cover. The proposed classificationprovides a common conceptual basis, and, hence, can be used for comparing dif-ferent existing schema/ontology matching techniques and systems as well as fordesigning new ones, taking advantages of state of the art solutions.

1 Introduction

Matching is a critical operation in many application domains, such as semantic web,schema/ontology integration, data warehouses, e-commerce, query mediation, etc. Ittakes as input two schemas/ontologies, each consisting of a set of discrete entities (e.g.,tables, XML elements, classes, properties, rules, predicates), and determines as outputthe relationships (e.g., equivalence, subsumption) holding between these entities.

Many diverse solutions to the matching problem have been proposed so far, e.g.,[2, 5, 21, 23, 40, 46, 49, 50, 52, 57, 76]. Good surveys through the recent years are pro-vided in [39, 62, 75]; while the major contributions of the last decades are presentedin [3, 41, 42, 66]. The survey of [39] focuses on current state of the art in ontologymatching. Authors review recent approaches, techniques and tools. The survey of [75]concentrates on approaches to ontology-based information integration and discussesgeneral matching approaches that are used in information integration systems. How-ever, none of the above mentioned surveys provide a comparative review of the existingontology matching techniques and systems. On the contrary, the survey of [62] is de-voted to a classification of database schema matching approaches and a comparative

� For more information on the topic (e.g., tutorials, relevant events), please visit the OntologyMatching web-site at www.OntologyMatching.org

review of matching systems. Notice that these three surveys address the matching prob-lem from different perspectives (artificial intelligence, information systems, databases)and analyze disjoint sets of systems.

This paper aims at considering the above mentioned works together, taking in to ac-count some novel schema/ontology matching approaches, and at providing a commonconceptual basis for their analysis. Although, there is a difference between schema andontology matching problems (see next section for details), we believe that techniquesdeveloped for each of them can be of a mutual benefit. Thus, we bring together anddiscuss systematically recent approaches and systems developed in schema and ontol-ogy matching domains. We present a new classification of schema/ontology matchingtechniques that builds on the work of [62] augmented in [29, 67] and [28]. Some in-novations are in introducing new criteria which are based on (i) general propertiesof matching techniques, (ii) interpretation of input information, and (iii) the kind ofinput information. In particular, we distinguish between approximate and exact tech-niques at schema-level; and syntactic, external, and semantic techniques at element- andstructure-level. Based on the classification proposed we provide a comparative reviewof the recent schema/ontology matching systems pointing which part of the solutionspace they cover. In this paper we focus only on schema-based solutions, i.e., matchingsystems exploiting only schema-level information, not instance data.

The rest of the paper is organized as follows. Section 2 provides, via an example,the basic motivations and definitions to the schema/ontology matching problem. Sec-tion 3 discusses possible matching dimensions. Section 4 introduces a classification ofelementary automatic schema-based matching techniques and discusses in detail possi-ble alternatives. Section 5 provides a vision on classifying matching systems. Section 6overviews some of the recent schema/ontology matching solutions in light of the classi-fication proposed pointing which part of the solution space they cover. Section 7 reportssome conclusions and discusses the future work.

2 The Matching Problem

2.1 Motivating Example

To motivate the matching problem, let us use two simple XML schemas that are shownin Figure 1 and exemplify one of the possible situations which arise, for example, whenresolving a schema integration task.

Suppose an e-commerce company needs to finalize a corporate acquisition of an-other company. To complete the acquisition we have to integrate databases of the twocompanies. The documents of both companies are stored according to XML schemas Sand S ′ respectively. Numbers in boxes are the unique identifiers of the XML elements.A first step in integrating the schemas is to identify candidates to be merged or to havetaxonomic relationships under an integrated schema. This step refers to a process ofschema matching. For example, the elements with labels Office Products in S and inS′ are the candidates to be merged, while the element with label Digital Cameras in S ′

should be subsumed by the element with label Photo and Cameras in S. Once the cor-respondences between two schemas have been determined, the next step has to generate

Fig. 1. Two XML schemas

query expressions that automatically translate data instances of these schemas under anintegrated schema.

2.2 Schema Matching vs Ontology Matching

Many different kinds of structures can be considered as data/conceptual models: de-scription logic terminologies, relational database schemas, XML-schemas, catalogs anddirectories, entity-relationship models, conceptual graphs, UML diagrams, etc. Most ofthe work on matching has been carried out among (i) database schemas in the world ofinformation integration, (ii) XML-schemas and catalogs on the web, and (iii) ontolo-gies in knowledge representation. Different parties, in general, have their own stead-fast preferences for storing data. Therefore, when coordinating/integrating data amongtheir information sources, it is often the case that we need to match between variousdata/conceptual models they are sticked to. In this respect, there are some importantdifferences and commonalities between schema and ontology matching. The key pointsare:

– Database schemas often do not provide explicit semantics for their data. Seman-tics is usually specified explicitly at design-time, and frequently is not becominga part of a database specification, therefore it is not available [56]. Ontologies arelogical systems that themselves obey some formal semantics, e.g., we can interpretontology definitions as a set of logical axioms.

– Ontologies and schemas are similar in the sense that (i) they both provide a vocab-ulary of terms that describes a domain of interest and (ii) they both constrain themeaning of terms used in the vocabulary [37, 70].

– Schemas and ontologies are found in such environments as the Semantic Web, andquite often in practice, it is the case that we need to match them.

On the one side, schema matching is usually performed with the help of techniquestrying to guess the meaning encoded in the schemas. On the other side, ontology match-ing systems (primarily) try to exploit knowledge explicitly encoded in the ontologies.

In real-world applications, schemas/ontologies usually have both well defined and ob-scure labels (terms), and contexts they occur, therefore, solutions from both problemswould be mutually beneficial.

2.3 Problem Statement

Following the work in [11, 27], we define a mapping element as a 5-uple: 〈id, e, e ′, n, R〉,where

– id is a unique identifier of the given mapping element;– e and e′ are the entities (e.g., tables, XML elements, properties, classes) of the first

and the second schema/ontology respectively;– n is a confidence measure in some mathematical structure (typically in the [0,1]

range) holding for the correspondence between the entities e and e ′;– R is a relation (e.g., equivalence (=); more general (�); disjointness (⊥); overlap-

ping (�)) holding between the entities e and e ′.

An alignment is a set of mapping elements. The matching operation determines thealignment (A′) for a pair of schemas/ontologies (o and o ′). There are some other pa-rameters which can extend the definition of the matching process, namely: (i) the use ofan input alignment (A) which is to be completed by the process; (ii) the matching pa-rameters, p (e.g., weights, thresholds); and (iii) external resources used by the matchingprocess, r (e.g., thesauri); see Figure 2.

o ��A �

o′ �� Matching A′�

p

�

r

�

Fig. 2. The matching process

For example, in Figure 1, according to some matching algorithm based on linguisticand structure analysis, the confidence measure (for the fact that the equivalence relationholds) between entities with labels Photo and Cameras in S and Cameras and Photoin S′ could be 0.67. Suppose that this matching algorithm uses a threshold of 0.55 fordetermining the resulting alignment, i.e., the algorithm considers all the pairs of entitieswith a confidence measure higher than 0.55 as correct mapping elements. Thus, our hy-pothetical matching algorithm should return to the user the following mapping element:〈id5,4, Photo and Cameras, Cameras and Photo, 0.67, =〉. However, the relationbetween the same pair of entities, according to another matching algorithm which isable to determine that both entities mean the same thing, could be exactly the equiva-lence relation (without computing the confidence measure). Thus, returning to the user〈id5,4, Photo and Cameras, Cameras and Photo, n/a,=〉.

2.4 Applications

Matching is an important operation in traditional applications, such as information in-tegration, data warehousing, distributed query processing, etc. Typically, these appli-cations are characterized by structural data/conceptual models, and are based on a de-sign time matching operation, thereby determining alignment (e.g., manually or semi-automatically) as a prerequisite of running the system.

There is an emerging line of applications which can be characterized by their dy-namics (e.g., agents, peer-to-peer systems, web services). Such applications, on thecontrary to traditional ones, require a run time matching operation and take advantageof more ”explicit” conceptual models.

Below, we first discuss an example of a traditional application, namely, catalog inte-gration. Then, we focus on emergent applications, namely, peer-to-peer (P2P) databases,agent communication, and web services integration.

Catalog Integration. In B2B applications, trade partners store their products in elec-tronic catalogs. Catalogs are tree-like structures, namely concept hierarchies with at-tributes. Typical examples of catalogs are product directories of www.amazon.com,www.ebay.com, etc. In order for a private company to participate in the marketplace(e.g., eBay), it is used to determine correspondences between entries of its catalogsand entries of a single catalog of a marketplace. This process of mapping entries amongcatalogs is referred to the catalog matching problem, see [12]. Having identified the cor-respondences between the entries of the catalogs, they are further analyzed in order togenerate query expressions that automatically translate data instances between the cata-logs, see, for example, [74]. Finally, having aligned the catalogs, users of a marketplacehave a unified access to the products which are on sale.

P2P Databases. P2P networks are characterized by an extreme flexibility and dynam-ics. Peers may appear and disappear on the network, their databases are autonomousin their language, contents, how they can change their schemas, and so on. Since peersare autonomous, they might use different terminology, even if they refer to the samedomain of interest. Thus, in order to establish (meaningful) information exchange be-tween peers, one of the steps is to identify and characterize relationships between theirschemas. Having identified the relationships between schemas, next step is to use theserelationships for the purpose of query answering, for example, using techniques appliedin data integration systems, namely Local-as-View (LAV), Global-as-View (GAV), orGlobal-Local-as-View (GLAV) [43]. However, P2P applications pose additional re-quirements on matching algorithms. In P2P settings an assumption that all the peersrely on one global schema, as in data integration, can not be made, because the globalschema may need to be updated any time the system evolves, see [36]. Thus, if in thecase of data integration schema matching operation can be performed at design time, inP2P applications peers need coordinating their databases on the fly, therefore requiringa run time schema matching operation.

Agent Communication. Agents are computer entities characterized by autonomy andcapacity of interaction. They communicate through speech-act inspired languages which

determine the ”envelope” of the messages and enable agents to position them within aparticular interaction context. The actual content of messages is expressed in knowledgerepresentation languages and often refer to some ontology. As a consequence, when twoautonomous and independently designed agents meet, they have the possibility of ex-changing messages, but little chance to understand each others if they do not share thesame content language and ontology. Thus, it is necessary to provide the possibility forthese agents to match their ontologies in order to either translate their messages or in-tegrate bridge axioms in their own models, see [73]. One solution to this problem is tohave an ontology alignment protocol that can be interleaved with any other agent inter-action protocol and which could be triggered upon receiving a message expressed in analien ontology. As a consequence, agents meeting each other for the first time and usingdifferent ontologies would be able to negotiate the matching of terms in their respectiveontologies and to translate the content of the message they exchange with the help ofthe alignment.

Web Services Integration. Web services are processes that expose their interface tothe web so that users can invoke them. Semantic web services provide a richer and moreprecise way to describe the services through the use of knowledge representation lan-guages and ontologies. Web service discovery and integration is the process of finding aweb service able to deliver a particular service and composing several services in orderto achieve a particular goal, see [59]. However, semantic web services descriptions haveno reasons to be expressed by reference to exactly the same ontologies. Henceforth, bothfor finding the adequate service and for interfacing services it will be necessary to es-tablish the correspondences between the terms of the descriptions. This can be providedthrough matching the corresponding ontologies. For instance, if some service providesits output description in some ontology and another service uses a second ontology fordescribing its input, matching both ontologies will be used for (i) checking that what isdelivered by the first service matches what is expected by the second one, (ii) verifyingpreconditions of the second service, and (iii) generating a mediator able to transformthe output of the first service in order to be input to the second one.

In some of the above mentioned applications (e.g., two agents meeting or look-ing for the web services integration) there are no instances given beforehand. Thus, it isnecessary to perform matching without them, based only on schema-level information.

3 The Matching Dimensions

There are many independent dimensions along which algorithms can be classified. Asfrom Figure 2, we may classify them according to (i) input of the algorithms, (ii) char-acteristics of the matching process, and (iii) output of the algorithms. Let us discussthem in turn.

Input dimensions. These dimensions concern the kind of input on which algorithmsoperate. As a first dimension, algorithms can be classified depending on the data / con-ceptual models in which ontologies or schemas are expressed. For example, the Artemis

[13] system supports the relational, OO, and ER models; Cupid [46] supports XML andrelational models; QOM [26] supports RDF and OWL models. A second possible di-mension depends on the kind of data that the algorithms exploit: different approachesexploit different information of the input data/conceptual models, some of them relyonly on schema-level information (e.g., Cupid [46], COMA [21]), others rely only oninstance data (e.g., GLUE [23]), or exploit both, schema- and instance-level informa-tion (e.g., QOM [26]). Even with the same data models, matching systems do not alwaysuse all available constructs, e.g., S-Match [34] when dealing with attributes discards in-formation about datatypes (e.g., string, integer), and uses only the attributes names. Ingeneral, some algorithms focus on the labels assigned to the entities, some considertheir internal structure and the type of their attributes, and some others consider theirrelations with other entities (see next section for details).

Process dimensions. A classification of the matching process could be based on itsgeneral properties, as soon as we restrict ourselves to formal algorithms. In particu-lar, it depends on the approximate or exact nature of its computation. Exact algorithmscompute the absolute solution to a problem; approximate algorithms sacrifice exactnessto performance (e.g., [26]). All the techniques discussed in the remainder of the papercan be either approximate or exact. Another dimension for analyzing the matching algo-rithms is based on the way they interpret the input data. We identify three large classesbased on the intrinsic input, external resources, or some semantic theory of the consid-ered entities. We call these three classes syntactic, external, and semantic respectively;and discuss them in detail in the next section.

Output dimensions. Apart from the information that matching systems exploit andhow they manipulate it, the other important class of dimensions concerns the form ofthe result they produce. The form of the alignment might be of importance: is it a one-to-one correspondence between the schema/ontology entities? Has it to be a final mappingelement? Is any relation suitable?

Other significant distinctions in the output results have been indicated in [32]. Onedimension concerns whether systems deliver a graded answer, e.g., that the correspon-dence holds with 98% confidence or 4/5 probability; or an all-or-nothing answer, e.g.,that the correspondence definitely holds or not. In some approaches correspondencesbetween schema/ontology entities are determined using distance measures. This is usedfor providing an alignment expressing equivalence between these entities in which theactual distance is the ground for generating a confidence measure in each correspon-dence, usually in [0,1] range, see, for example, [29, 46]. Another dimension concernsthe kind of relations between entities a system can provide. Most of the systems focuson equivalence (=), while a few other are able to provide a more expressive result (e.g.,equivalence, subsumption (�), incompatibility (⊥), see for details [12, 33]).

There are many dimensions that can be taken into account when attempting at clas-sifying matching methods. In the next section we present a classification of elementarytechniques that draws simultaneously on several such criteria.

4 A retained classification of elementary schema-based matchingapproaches

In this section we discuss only schema-based elementary matchers. We address issuesof their combination in the next section. Therefore, only schema/ontology informationis considered, not instance data1. The exact/approximate opposition has not been usedbecause each of the methods described below can be implemented as exact or approxi-mate algorithm, depending on the goals of the matching system. To ground and ensurea comprehensive coverage for our classification we have analyzed state of the art ap-proaches used for schema-based matching. The references section reports a partial listof works which have been scrutinized pointing to (some of) the most important contri-butions. We have used the following guidelines for building our classification:

Exhaustivity. The extension of categories dividing a particular category must cover itsextension (i.e., their aggregation should give the complete extension of the cate-gory);

Disjointness. In order to have a proper tree, the categories dividing one category shouldbe pairwise disjoint by construction;

Homogeneity. In addition, the criterion used for further dividing one category shouldbe of the same nature (i.e., should come from the same dimension). This usuallyhelps guaranteeing disjointness;

Saturation. Classes of concrete matching techniques should be as specific and discrim-inative as possible in order to provide a fine grained distinction between possiblealternatives. These classes have been identified following a saturation principle:they have been added/modified till the saturation was reached, namely taking intoaccount new techniques did not require introducing new classes or modifying them.

Notice that disjointness and exhaustivity of the categories ensures stability of the clas-sification, namely new techniques will not occur in between two categories. Classes ofmatching techniques represent the state of the art. Obviously, with appearance of newtechniques, they might be extended and further detailed.

As indicated in introduction, we build on the previous work of classifying automatedschema matching approaches of [62]. The classification of [62] distinguishes betweenelementary (individual) matchers and combinations of matchers. Elementary matcherscomprise instance-based and schema-based, element- and structure-level, linguistic-and constrained-based matching techniques. Also cardinality and auxiliary information(e.g., thesauri, global schemas) can be taken into account.

For classifying elementary schema-based matching techniques, we introduce twosynthetic classifications (see Figure 3), based on what we have found the most salientproperties of the matching dimensions. These two classifications are presented as twotrees sharing their leaves. The leaves represent classes of elementary matching tech-niques and their concrete examples. Two synthetic classifications are:

1 Prominent solutions of instance-based schema/ontology matching as well as possible exten-sions of the instance-based part of the classification of [62] can be found in [23] and [40]respectively.

Schema-Based Matching Techniques

Element-level Structure-level

Syntactic SemanticExternal

String-based

Constraint-based

Graph-based

Taxonomy-based

Linguisticresource

Model-based

- Name similarity- Description similarity- Global

namespaces

- Type similarity- Key properties

- Lexicons- Thesauri

- Graph matching- Paths- Children- Leaves

- Taxonomic structure

- Propositional SAT- DL-based

Language-based

- Tokenization- Lemmatization- Morphological analysis- Elimination

Alignmentreuse

- Entire schema/ ontology- Fragments

Terminological Structural

Syntactic

Linguistic Internal Relational

Semantic

Schema-Based Matching Techniques

Granularity /Input InterpretationLayer

Basic TechniquesLayer

Kind of InputLayer

Upper levelformal

ontologies- SUMO, DOLCE

External

Repository ofstructures

- Structure'smetadata

Fig. 3. A retained classification of elementary schema-based matching approaches

– Granularity/Input Interpretation classification is based on (i) granularity of match,i.e., element- or structure-level, and then (ii) on how the techniques generally inter-pret the input information;

– Kind of Input classification is based on the kind of input which is used by elemen-tary matching techniques.

The overall classification of Figure 3 can be read both in descending (focusing onhow the techniques interpret the input information) and ascending (focusing on the kindof manipulated objects) manner in order to reach the Basic Techniques layer. Let usdiscuss in turn Granularity/Input Interpretation, Basic Techniques, Kind of Input layerstogether with supporting arguments for the categories/classes introduced at each layer.

Elementary matchers are distinguished by the Granularity/Input interpretation layeraccording to the following classification criteria:

– Element-level vs structure-level. Element-level matching techniques compute map-ping elements by analyzing entities in isolation, ignoring their relations with otherentities. Structure-level techniques compute mapping elements by analyzing howentities appear together in a structure. This criterion is the same as first introducedin [62].

– Syntactic vs external vs semantic. The key characteristic of the syntactic techniquesis that they interpret the input in function of its sole structure following someclearly stated algorithm. External are the techniques exploiting auxiliary (exter-nal) resources of a domain and common knowledge in order to interpret the input.These resources might be human input or some thesaurus expressing the relation-ships between terms. The key characteristic of the semantic techniques is that theyuse some formal semantics (e.g., model-theoretic semantics) to interpret the inputand justify their results. In case of a semantic based matching system, exact algo-rithms are complete (i.e., they guarantee a discovery of all the possible mappings)while approximate algorithms tend to be incomplete.

To emphasize the differences with the initial classification of [62], the new cate-gories/classes are marked in bold face. In particular, in the Granularity/Input Interpre-tation layer we detail further (with respect to [62]), the element- and structure-levelof matching by introducing the syntactic vs semantic vs external criteria. The reasonsof having these three categories are as follows. Our initial criterion was to distinguishbetween internal and external techniques. By internal we mean techniques exploitinginformation which comes only with the input schemas/ontologies. External techniquesare as defined above. Internal techniques can be further detailed by distinguishing be-tween syntactic and semantic interpretation of input, also as defined above. However,only limited, the same distinction can be introduced for the external techniques. In fact,we can qualify some oracles (e.g., WordNet [53], DOLCE [31]) as syntactic or seman-tic, but not a user’s input. Thus, we do not detail external techniques any further andwe omit in Figure 3 the theoretical category of internal techniques (as opposed to exter-nal). Notice, that we also omit in further discussions element-level semantic techniques,since semantics is usually given in a structure, and, hence, there are no element-levelsemantic techniques.

Distinctions between classes of elementary matching techniques in the Basic Tech-niques layer of our classification are motivated by the way a matching technique in-terprets the input information in each concrete case. In particular, a label can be in-terpreted as a string (a sequence of letters from an alphabet) or as a word or a phrasein some natural language, a hierarchy can be considered as a graph (a set of nodesrelated by edges) or a taxonomy (a set of concepts having a set-theoretic interpreta-tion organized by a relation which preserves inclusion). Thus, we introduce the follow-ing classes of elementary schema/ontology matching techniques at the element-level:string-based, language-based, based on linguistic resources, constraint-based, align-ment reuse, and based on upper level ontologies. At the structure-level we distinguishbetween: graph-based, taxonomy-based, based on repositories of structures, and model-based techniques.

The Kind of Input layer classification is concerned with the type of input consideredby a particular technique:

– The first level is categorized depending on which kind of data the algorithms workon: strings (terminological), structure (structural) or models (semantics). The twofirst ones are found in the ontology descriptions, the last one requires some seman-tic interpretation of the ontology and usually uses some semantically compliantreasoner to deduce the correspondences.

– The second level of this classification decomposes further these categories if nec-essary: terminological methods can be string-based (considering the terms as se-quences of characters) or based on the interpretation of these terms as linguisticobjects (linguistic). The structural methods category is split into two types of meth-ods: those which consider the internal structure of entities (e.g., attributes and theirtypes) and those which consider the relation of entities with other entities (rela-tional).

Notice that following the above mentioned guidelines for building a classification theterminological category should be divided into linguistic and non-linguistic techniques.However, since non-linguistic techniques are all string-based, this category has beendiscarded.

We discuss below the main classes of the Basic Techniques layer (also indicatingin which matching systems they are exploited) according to the above classification inmore detail. The order follows that of the Granularity/Input Interpretation classificationand these techniques are divided in two sections concerning element-level techniques(§4.1) and structure-level techniques (§4.2). Finally, in Figure 3, techniques which aremarked in italic (techniques based on upper level ontologies and DL-based techniques)have not been implemented in any matching system yet. However, we are arguing whytheir appearance seems reasonable in the near future.

4.1 Element-level techniques

String-based techniques are often used in order to match names and name descrip-tions of schema/ontology entities. These techniques consider strings as sequences ofletters in an alphabet. They are typically based on the following intuition: the moresimilar the strings, the more likely they denote the same concepts. A comparison ofdifferent string matching techniques, from distance like functions to token-based dis-tance functions can be found in [16]. Usually, distance functions map a pair of stringsto a real number, where a smaller value of the real number indicates a greater similaritybetween the strings. Some examples of string-based techniques which are extensivelyused in matching systems are prefix, suffix, edit distance, and n-gram.

– Prefix. This test takes as input two strings and checks whether the first string startswith the second one. Prefix is efficient in matching cognate strings and similaracronyms (e.g., int and integer), see, for example [21, 34, 46, 50]. This test can betransformed in a smoother distance by measuring the relative size of the prefix andthe ratio.

– Suffix. This test takes as input two strings and checks whether the first string endswith the second one (e.g., phone and telephone), see, for example [21, 34, 46, 50].

– Edit distance. This distance takes as input two strings and computes the edit dis-tance between the strings. That is, the number of insertions, deletions, and substitu-tions of characters required to transform one string into another, normalized by thelength of the longest string. For example, the edit distance between NKN and Nikonis 0.4. Some of matching systems exploiting the given technique are [21, 34, 57].

– N-gram. This test takes as input two strings and computes the number of commonn-grams (i.e., sequences of n characters) between them. For example, trigram(3)for the string nikon are nik, iko, kon. Thus, the distance between nkon and nikonwould be 1/3. Some of matching systems exploiting the given test are [21, 34].

Language-based techniques consider names as words in some natural language (e.g.,English). They are based on Natural Language Processing (NLP) techniques exploitingmorphological properties of the input words.

– Tokenization. Names of entities are parsed into sequences of tokens by a tokenizerwhich recognizes punctuation, cases, blank characters, digits, etc. (e.g., Hands-Free Kits → 〈hands, free, kits〉, see, for example [33]).

– Lemmatization. The strings underlying tokens are morphologically analyzed in or-der to find all their possible basic forms (e.g., Kits → Kit), see, for example [33].

– Elimination. The tokens that are articles, prepositions, conjunctions, and so on, aremarked (by some matching algorithms, e.g., [46]) to be discarded.

Usually, the above mentioned techniques are applied to names of entities before run-ning string-based or lexicon-based techniques in order to improve their results. How-ever, we consider these language-based techniques as a separate class of matching tech-niques, since they can be naturally extended, for example, in a distance computation(by comparing the resulting strings or sets of strings).

Constraint-based techniques are algorithms which deal with the internal constraintsbeing applied to the definitions of entities, such as types, cardinality of attributes, andkeys. We omit here a discussion of matching keys as these techniques appear in ourclassification without changes with respect to the original publication [62]. However,we provide a different perspective on matching datatypes and cardinalities.

– Datatypes comparison involves comparing the various attributes of a class withregard to the datatypes of their value. Contrary to objects that require interpreta-tions, the datatypes can be considered objectively and it is possible to determinehow a datatype is close to another (ideally this can be based on the interpretation ofdatatypes as sets of values and the set-theoretic comparison of these datatypes, see[71, 72]). For instance, the datatype day can be considered closer to the datatypeworkingday than the datatype integer. This technique is used in [30].

– Multiplicity comparison attribute values can be collected by a particular construc-tion (set, list, multiset) on which cardinality constraints are applied. Again, it ispossible to compare the so constructed datatypes by comparing (i) the datatypeson which they are constructed and (ii) the cardinality that are applied to them. Forinstance, a set of between 2 and 3 children is closer to a set of 3 people than a setof 10-12 flowers (if children are people). This technique is used in [30].

Linguistic resources such as common knowledge or domain specific thesauri are usedin order to match words (in this case names of schema/ontology entities are consid-ered as words of a natural language) based on linguistic relations between them (e.g.,synonyms, hyponyms).

– Common knowledge thesauri. The approach is to use common knowledge thesaurito obtain meaning of terms used in schemas/ontologies. For example, WordNet [53]is an electronic lexical database for English (and other languages), where varioussenses (possible meanings of a word or expression) of words are put together intosets of synonyms. Relations between schema/ontology entities can be computedin terms of bindings between WordNet senses, see, for instance [12, 33]. For ex-ample, in Figure 1, a sense-based matcher may learn from WordNet (with a priormorphological preprocessing of labels performed) that Camera in S is a hyper-nym for Digital Camera in S ′, and, therefore conclude that entity Digital Camerasin S′ should be subsumed by the entity Photo and Cameras in S. Another type ofmatchers exploiting thesauri is based on their structural properties, e.g., WordNethierarchies. In particular, hierarchy-based matchers measure the distance, for ex-ample, by counting the number of arcs traversed, between two concepts in a givenhierarchy, see [35]. Several other distance measures for thesauri have been proposedin the literature, e.g., [61, 64].

– Domain specific thesauri. These thesauri usually store some specific domain knowl-edge, which is not available in the common knowledge thesauri, (e.g., proper names)as entries with synonym, hypernym and other relations. For example, in Figure 1,entities NKN in S and Nikon in S ′ are treated by a matcher as synonyms from adomain thesaurus look up: syn key - ”NKN:Nikon = syn”, see, for instance [46].

Alignment reuse techniques represent an alternative way of exploiting external re-sources, which contain in this case alignments of previously matched schemas/ontologies.For instance, when we need to match schema/ontology o ′ and o′′, given the alignmentsbetween o and o′, and between o and o′′ from the external resource, storing previousmatch operations results. The alignment reuse is motivated by the intuition that manyschemas/ontologies to be matched are similar to already matched schemas/ontologies,especially if they are describing the same application domain. These techniques are par-ticularly promising when dealing with large schemas/ontologies consisting of hundredsand thousands of entities. In these cases, first, large match problems are decomposedinto smaller sub-problems, thus generating a set of schema/ontology fragments match-ing problems. Then, reusing previous match results can be more effectively applied atthe level of schema/ontology fragments compared to entire schemas/ontologies. Theapproach was first introduced in [62], and later was implemented as two matchers, i.e.,(i) reuse alignments of entire schemas/ontologies, or (ii) their fragments, see, for details[2, 21, 63].

Upper level formal ontologies can be also used as external sources of common knowl-edge. Examples are the Suggested Upper Merged Ontology (SUMO) [55] and Descrip-tive Ontology for Linguistic and Cognitive Engineering (DOLCE) [31]. The key char-acteristic of these ontologies is that they are logic-based systems, and therefore, match-ing techniques exploiting them can be based on the analysis of interpretations. Thus,these are semantic techniques. For the moment, we are not aware of any matching sys-tems which use these kind of techniques. However, it is quite reasonable to assume thatthis will happen in the near future. In fact, for example, the DOLCE ontology aims

at providing a formal specification (axiomatic theory) for the top level part of Word-Net. Therefore, systems exploiting WordNet now in their matching process might alsoconsider using DOLCE as a potential extension.

4.2 Structure-level techniques

Graph-based techniques are graph algorithms which consider the input as labeledgraphs. The applications (e.g., database schemas, taxonomies, or ontologies) are viewedas graph-like structures containing terms and their inter-relationships. Usually, the sim-ilarity comparison between a pair of nodes from the two schemas/ontologies is basedon the analysis of their positions within the graphs. The intuition behind is that, if twonodes from two schemas/ontologies are similar, their neighbors might also be somehowsimilar. Below, we present some particular matchers representing this intuition.

– Graph matching. There have been done a lot of work on graph (tree) matchingin graph theory and also with respect to schema/ontology matching applications,see, for example, [65, 77]. Matching graphs is a combinatorial problem that canbe computationally expensive. It is usually solved by approximate methods. Inschema/ontology matching, the problem is encoded as an optimization problem(finding the graph matching minimizing some distance like the dissimilarity be-tween matched objects) which is further resolved with the help of a graph match-ing algorithm. This optimization problem is solved through a fix-point algorithm(improving gradually an approximate solution until no improvement is made). Ex-amples of such algorithms are [50] and [30]. Some other (particular) matchers han-dling DAGs and trees are children, leaves, and relations.

– Children. The (structural) similarity between inner nodes of the graphs is computedbased on similarity of their children nodes, that is, two non-leaf schema elementsare structurally similar if their immediate children sets are highly similar. A morecomplex version of this matcher is implemented in [21].

– Leaves. The (structural) similarity between inner nodes of the graphs is computedbased on similarity of leaf nodes, that is, two non-leaf schema elements are struc-turally similar if their leaf sets are highly similar, even if their immediate childrenare not, see, for example [21, 46].

– Relations. The similarity computation between nodes can also be based on theirrelations. For example, in one of the possible ontology representations of schemasof Figure 1, if class Photo and Cameras relates to class NKN by relation hasBrandin one ontology, and if class Digital Cameras relates to class Nikon by relation has-Marque in the other ontology, then knowing that classes Photo and Cameras andDigital Cameras are similar, and also relations hasBrand and hasMarque are simi-lar, we can infer that NKN and Nikon may be similar too, see [48].

Taxonomy-based techniques are also graph algorithms which consider only the spe-cialization relation. The intuition behind taxonomic techniques is that is-a links connectterms that are already similar (being a subset or superset of each other), therefore theirneighbors may be also somehow similar. This intuition can be exploited in several dif-ferent ways:

– Bounded path matching. Bounded path matchers take two paths with links betweenclasses defined by the hierarchical relations, compare terms and their positionsalong these paths, and identify similar terms, see, for instance [57]. For example, inFigure 1, given that element Digital Cameras in S ′ should be subsumed by the ele-ment Photo and Cameras in S, a matcher would suggest FJFLM in S and FujiFilmin S′ as an appropriate match.

– Super(sub)-concepts rules. These matchers are based on rules capturing the abovestated intuition. For example, if super-concepts are the same, the actual conceptsare similar to each other. If sub-concepts are the same, the compared concepts arealso similar, see, for example [19, 26].

Repository of structures stores schemas/ontologies and their fragments together withpairwise similarities (e.g., coefficients in the [0 1] range) between them. Notice, thatunlike the alignment reuse, repository of structures stores only similarities betweenschemas/ontologies, not alignments. In the following, to simplify the presentation, wecall schemas/ontologies or their fragments as structures. When new structures are to bematched, they are first checked for similarity to the structures which are already avail-able in the repository. The goal is to identify structures which are sufficiently similar tobe worth matching in more detail, or reusing already existing alignments. Thus, avoid-ing the match operation over the dissimilar structures. Obviously, the determination ofsimilarity between structures should be computationally cheaper than matching themin full detail. The approach of [63], in order to match two structures, proposes to usesome metadata describing these structures, such as structure name, root name, numberof nodes, maximal path length, etc. Then, these indicators are analyzed and are aggre-gated into a single coefficient, which estimates similarity between them. For example,schema S1 might be found as an appropriate match to schema S2 since they both havethe same number of nodes.

Model-based algorithms handle the input based on its semantic interpretation (e.g.,model-theoretic semantics). Thus, they are well grounded deductive methods. Examplesare propositional satisfiability (SAT) and description logics (DL) reasoning techniques.

– Propositional satisfiability (SAT). As from [12, 32, 33], the approach is to decom-pose the graph (tree) matching problem into the set of node matching problems.Then, each node matching problem, namely pairs of nodes with possible rela-tions between them is translated into a propositional formula of form: Axioms →rel(context1, context2), and checked for validity. Axioms encodes backgroundknowledge (e.g., Digital Cameras→ Cameras codifies the fact that Digital Camerasis less general than Cameras), which is used as premises to reason about relationsrel (e.g., =, �, �, ⊥) holding between the nodes context1 and context2 (e.g.,node 7 in S and node 12 in S ′). A propositional formula is valid iff its negation isunsatisfiable. The unsatisfiability is checked by using state of the art SAT solvers.Notice that SAT deciders are correct and complete decision procedures for propo-sitional satisfiability, and therefore, they can be used for an exhaustive check of allthe possible mappings.

– DL-based techniques. The SAT-based approach computes the satisfiability of the-ory merging both schemas/ontologies along an alignment. Propositional languageused for codifying matching problems into propositional unsatisfiability problemsis limited in its expressivity, namely it allows for handling only unary predicates.Thus, it can not handle, for example, binary predicates, such as properties or roles.However, the same procedure can be carried within description logics (expressingproperties). In description logics, the relations (e.g., =, �, �, ⊥) can be expressedin function of subsumption. In fact, first merging two ontologies (after renaming)and then testing each pair of concepts and roles for subsumption is enough foraligning terms with the same interpretation (or with a subset of the interpretationsof the others). For instance, suppose that we have one ontology introducing classescompany, employee and micro-company as a company with at most 5 employees,and another ontology introducing classes firm, associate and SME as a firm with atmost 10 associates. If we know that all associates are employees and we alreadyhave established that firm is equivalent to company, then we can deduce that a micro-company is a SME. However, we are not aware of existence of any schema/ontologymatching systems supporting DL-based techniques for the moment.

There are examples in the literature of DL-based techniques used in relevant toschema/ontology matching applications. For example, in spatio-temporal database in-tegration scenario, as first motivated in [60] and later developed in [68] the inter-schemamapping elements are initially proposed by the integrated schema designer and are en-coded together with input schemas in ALCRP(S 2⊕T ) language. Then, DL reasoningservices are used to check the satisfiability of the two source schemas and the set ofinter-schema mapping elements. If some objects are found unsatisfied, then the inter-schema mapping elements should be reconsidered.

Another example, is when DL-based techniques are used in query processing sce-nario [52]. The approach assumes that mapping elements between pre-existing domainontologies are already specified in a declarative manner (e.g., manually). User queriesare rewritten in terms of pre-existing ontologies and are expressed in Classic [10], andfurther evaluated against real-world repositories, which are also subscribed to the pre-existing ontologies. An earlier approach for query answering by terminological reason-ing is described in [4].

Finally, a very similar problem to schema/ontology matching is addressed withinthe system developed for matchmaking in electronic marketplaces [18]. Demand Dand supply S requests are translated from natural language sentences into Classic [10].The approach assumes the existence of a pre-defined domain ontology T , which is alsoencoded in Classic. Matchmaking between a supply S and a demand D is performedwith respect to the pre-defined domain ontology T . Reasoning is performed with thehelp of the NeoClassic reasoner in order to determine the exact match (T |= (D � S))and (T |= (S � D)), potential match (if D � S is satisfiable in T ), and nearly miss(if D � S is unsatisfiable in T ). The system also provides a logically based matchingresults rank operation.

5 On classifying matching systems

As the previous section indicates, elementary matchers rely on a particular kind of inputinformation, therefore they have different applicability and value with respect to differ-ent schema/ontology matching tasks. State of the art matching systems are not made ofa single elementary matcher. They usually combine them: elementary matchers can beused in sequence (called hybrid matchers in [62]), examples are [5, 46], or in parallel(also called composite matchers [62]) combining the results (e.g., taking the average,maximum) of independently executed matchers, see, for instance [21, 23, 26]. Findinga better classification here is rather difficult.

The distinction between the sequential and parallel composition is useful from anarchitectural perspective. However, it does not show how the systems can be distin-guished in the matter of considering the alignment and the matching task, thus repre-senting an user-centric perspective. Below, we provide a vision of a classification ofmatching systems with respect to this point:

– Alignments as solutions. This category covers purely algorithmic techniques thatconsider that an alignment is a solution to the matching problem. It could be char-acterized as a (continuous or discrete) optimization problem, see, for example [30,50].

– Alignments as theorems. Systems of this category rely on semantics and requirethe alignment to satisfy it. This category, strictly speaking, is a sub-category ofalignments as solutions (the problem is expressed in semantic terms). However, itis sufficiently autonomous for being singled out, see, for example [32, 33].

– Alignments as likeness clues. This category refers to the algorithms which aim atproducing reasonable indications for a user to select the alignment, although usingthe same techniques (e.g., string-based) as systems from the alignments as solutionscategory, see, for example [21, 46].

6 Review of state of the art matching systems

We now look at some recent schema-based state of the art matching systems in light ofthe classification presented in Figure 3 and criteria highlighted in Section 5.

Similarity Flooding. The Similarity Flooding (SF) [50] approach utilizes a hybridmatching algorithm based on the ideas of similarity propagation. Schemas are presentedas directed labeled graphs; grounding on the OIM specification [15] the algorithm ma-nipulates them in an iterative fix-point computation to produce an alignment betweenthe nodes of the input graphs. The technique starts from string-based comparison (com-mon prefix, suffix tests) of the vertices labels to obtain an initial alignment which isrefined within the fix-point computation. The basic concept behind the SF algorithm isthe similarity spreading from similar nodes to the adjacent neighbors through propaga-tion coefficients. From iteration to iteration the spreading depth and a similarity measureare increasing till the fix-point is reached. The result of this step is a refined alignmentwhich is further filtered to finalize the matching process. SF considers the alignment asa solution to a clearly stated optimization problem.

Artemis. Artemis (Analysis of Requirements: Tool Environment for Multiple In-formation Systems) [13] was designed as a module of the MOMIS mediator system [5]for creating global views. It performs affinity-based analysis and hierarchical clusteringof source schema elements. Affinity-based analysis represents the matching step: in ahybrid manner it calculates the name, structural and global affinity coefficients exploit-ing a common thesaurus. The common thesaurus is built with the help of ODB-Tools,WordNet or manual input. It represents a set of intensional and extensional relationshipswhich depict intra- and inter-schema knowledge about classes and attributes of the in-put schemas. Based on global affinity coefficients, a hierarchical clustering techniquecategorizes classes into groups at different levels of affinity. For each cluster it createsa set of global attributes - global class. Logical correspondence between the attributesof a global class and source schema attributes is determined through a mapping table.Artemis falls into the alignments as likeness clues category.

Cupid. Cupid [46] implements a hybrid matching algorithm comprising linguis-tic and structural schema matching techniques, and computes similarity coefficientswith the assistance of a domain specific thesauri. Input schemas are encoded as graphs.Nodes represent schema elements and are traversed in a combined bottom-up and top-down manner. The matching algorithm consists of three phases and operates only withtree-structures to which non-tree cases are reduced. The first phase (linguistic match-ing) computes linguistic similarity coefficients between schema element names (labels)based on morphological normalization, categorization, string-based techniques (com-mon prefix, suffix tests) and a thesauri look-up. The second phase (structural matching)computes structural similarity coefficients weighted by leaves which measure the sim-ilarity between contexts in which elementary schema elements occur. The third phase(mapping elements generation) computes weighted similarity coefficients and generatesfinal alignment by choosing pairs of schema elements with weighted similarity coeffi-cients which are higher than a threshold. Referring to [46], Cupid performs somewhatbetter overall, than the other hybrid matchers: Dike [58] and Artemis [13]. Cupid fallsinto the alignments as likeness clues category.

COMA. COMA (COmbination of MAtching algorithms) [21] is a composite schemamatching tool. It provides an extensible library of matching algorithms; a framework forcombining obtained results, and a platform for the evaluation of the effectiveness of thedifferent matchers. Matching library is extensible, and as from [21] it contains 6 ele-mentary matchers, 5 hybrid matchers, and one reuse-oriented matcher. Most of themimplement string-based techniques (affix, n-gram, edit distance, etc.) as a backgroundidea; others share techniques with Cupid (thesauri look-up, etc.); and reuse-oriented isa completely novel matcher, which tries to reuse previously obtained results for entirenew schemas or for its fragments. Schemas are internally encoded as DAGs, whereelements are the paths. This aims at capturing contexts in which the elements occur.Distinct features of the COMA tool in respect to Cupid, are a more flexible architectureand a possibility of performing iterations in the matching process. Based on the com-parative evaluations conducted in [20], COMA dominates Autoplex[6] and Automatch[7]; LSD [22] and GLUE [23]; SF [50], and SemInt [44] matching tools. COMA fallsinto the alignments as likeness clues category.

NOM. NOM (Naive Ontology Mapping) [26] adopts the idea of composite match-ing from COMA [21]. Some other innovations with respect to COMA, are in the setof elementary matchers based on rules, exploiting explicitly codified knowledge in on-tologies, such as information about super- and sub-concepts, super- and sub-properties,etc. At present the system supports 17 rules. For example, rule (R5) states that if super-concepts are the same, the actual concepts are similar to each other. NOM also exploitsa set of instance-based techniques, this topic is beyond scope of the paper. The systemfalls into the alignments as likeness clues category.

QOM. QOM (Quick Ontology Mapping) [25] is a successor of the NOM system[26]. The approach is based on the idea that the loss of quality in matching algorithmsis marginal (to a standard baseline), however improvement in efficiency can be tremen-dous. This fact allows QOM to produce mapping elements fast, even for large-sizeontologies. QOM is grounded on matching rules of NOM. However, for the purpose ofefficiency the use of some rules have been restricted, e.g., R5. QOM avoids the completepair-wise comparison of trees in favor of a ( n incomplete) top-down strategy. Exper-imental study has shown that QOM is on a par with other state of the art algorithmsconcerning the quality of proposed alignment, while outperforming them with respectto efficiency. Also, QOM shows better quality results than approaches within the samecomplexity class. The system falls into the alignments as likeness clues category.

OLA. OLA (OWL Lite Aligner) [30] is designed with the idea of balancing thecontribution of each component that compose an ontology (classes, properties, names,constraints, taxonomy, and even instances). As such it takes advantage of all the ele-mentary matching techniques that have been considered in the previous sections, but thesemantic ones. OLA is a family of distance based algorithms which converts definitionsof distances based on all the input structures into a set of equations. These distances arealmost linearly aggregated (they are linearly aggregated modulo local matches of enti-ties). The algorithm then looks for the matching between the ontologies that minimizesthe overall distance between them. For that purpose it starts with base distance measurescomputed from labels and concrete datatypes. Then, it iterates a fix-point algorithm un-til no improvement is produced. From that solution, an alignment is generated whichsatisfies some additional criterion (on the alignment obtained and the distance betweenaligned entities). As a system, OLA considers the alignment as a solution to a clearlystated optimization problem.

Anchor-PROMPT. Anchor-PROMPT [57] (an extension of PROMPT, also for-merly known as SMART) is an ontology merging and alignment tool with a sophisti-cated prompt mechanism for possible matching terms. The anchor-PROMPT is a hy-brid alignment algorithm which takes as input two ontologies, (internally representedas graphs) and a set of anchors-pairs of related terms, which are identified with the helpof string-based techniques (edit-distance test), or defined by a user, or another matchercomputing linguistic similarity, for example [49]. Then the algorithm refines them byanalyzing the paths of the input ontologies limited by the anchors in order to deter-mine terms frequently appearing in similar positions on similar paths. Finally, basedon the frequencies and a user feedback, the algorithm determines matching candidates.

Anchor-PROMPT falls into the alignments as solutions and alignments as likeness cluescategories.

S-Match. S-Match [32–34] is a schema-based matching system. It takes two graph-like structures (e.g., XML schemas or ontologies) and returns semantic relations (e.g.,equivalence, subsumption) between the nodes of the graphs that correspond semanti-cally to each other. The relations are determined by analyzing the meaning (concepts,not labels) which is codified in the elements and the structures of schemas/ontologies.In particular, labels at nodes, written in natural language, are translated into proposi-tional formulas which explicitly codify the label’s intended meaning. This allows for atranslation of the matching problem into a propositional unsatisfiability problem, whichcan then be efficiently resolved using (sound and complete) state of the art propositionalsatisfiability deciders. S-Match was designed and developed as a platform for semanticmatching, namely a highly modular system with the core of computing semantic rela-tions where single components can be plugged, unplugged or suitably customized. It isa hybrid system with a composition at the element level. At present, S-Match librariescontain 13 element-level matchers, see [35], and 3 structure-level matchers (e.g., SAT4J[9]). S-Match falls into the alignments as theorems category.

Table 1. Characteristics of state of the art matching approaches

Element-level Structure-levelSyntactic External Syntactic Semantic

string-based (2); iterative fix-pointSF [50] data types; - computation -

key propertiescommon thesaurus (CT): matching

Artemis [13] domain compatibility; synonyms, of neighbors -language-based (1) broader terms, via CT

related termsstring-based (2); auxiliary thesauri: tree matching

Cupid [46] language-based (2); synonyms, weighted by leaves -data types; hypernyms,

key properties abbreviationsstring-based (4); auxiliary thesauri: DAG (tree) matching

language-based (1); synonyms, with a bias towardsCOMA [21] data types hypernyms, leaf or children structures (2); -

abbreviations; pathsalignment reuse (2)

NOM [26] string-based (1); application-specific matching of neighbors (2);FOAM/QOM [25] domains and ranges vocabulary taxonomic structure (4) -

bounded paths matchingAnchor- string-based (1); - (arbitrary links); -

PROMPT [57] domains and ranges bounded paths matching(processing is-a links separately)

string-based (3); iterative fix-pointOLA [30] language-based (1); WordNet(1) computation; -

data types matching of neighbors;taxonomic structure

string-based (5); WordNet:S-Match [33, 34] language-based (3); sense-based (2), - propositional SAT (2)

gloss-based (6)

Table 1 summarizes how the matching systems cover the solution space in terms ofthe proposed classification. Numbers in brackets specify how many matchers of a par-ticular type a system supports. For example, S-Match supports 5 string-based element-

level syntactic matchers (prefix, suffix, edit distance, n-gram, and text corpus, see [34]),OLA has one element-level external matcher based on WordNet. Table 1 also testifiesthat schema/ontology matching research was mainly focused on syntactic and externaltechniques so far. Semantic techniques have been exploited only by S-Match [33].

Having considered some of the recent schema-based matching systems, it is im-portant to notice that the matching operation typically constitutes only one of the stepstowards the ultimate goal of, e.g., schema/ontology integration, web services integrationor meta data management. To this end, we would like to mention some existing infras-tructures, which use matching as one of its integral components. Some examples areChimaera [49], OntoMerge [24], Rondo [51], MAFRA [47], Protoplasm [8]. The goalof such infrastructures is to enable a user with a possibility of performing such high-level tasks, e.g., given a product request expressed in terms of the catalog C1, returnthe products satisfying the request from the marketplaces MP1 and MP2. Moreover,use matching component M5, and translate instances by using component T 3.

7 Conclusions

This paper presents a new classification of schema-based matching approaches, whichimproves the previous work on the topic. We have introduced new criteria which arebased on (i) general properties of matching techniques, i.e., we distinguish betweenapproximate and exact techniques; (ii) interpretation of input information, i.e., we dis-tinguish between syntactic, external, and semantic techniques at element- and structure-level; and (iii) the kind of input information, i.e., we distinguish between termino-logical, structural, and semantic techniques. We have reviewed some of the recentschema/ontology matching systems in light of the classification proposed pointing whichpart of the solution space they cover. Analysis of state of the art systems discussed hasshown, that most of them exploit only syntactic and external techniques, following theinput interpretation classification; or terminological and structural techniques, follow-ing the kind of input classification; and only one uses semantic techniques, followingboth classifications. However, the category of semantic techniques was identified onlyrecently as a part of the solution space; its methods provide sound and complete results,and, hence it represents a promising area for the future investigations.

The proposed classification provides a common conceptual basis, and, hence, can beused for comparing (analytically) different existing schema/ontology matching systemsas well as for designing a new one, taking advantages of state of the art solutions. Asthe paper shows, the solution space is quite large and there exists a variety of matchingtechniques. In some cases it is difficult to draw conclusions from the classifications ofsystems. A complementary approach is to compare matching systems experimentally,with the help of benchmarks which measure the quality of the alignment (e.g., comput-ing precision, recall, overall indicators) and the performance of systems (e.g., measuringexecution time, main memory indicators). We have started working on such an approachand we have found useful our classifications for designing systematic benchmarks, e.g.,by discarding features (one by one) from schemas/ontologies with respect to the classi-fications we had (namely, what class of basic techniques deals with what feature), see

for preliminary results the I3CON initiative2, Ontology Alignment Contest3 [69], andOntology Alignment Evaluation Initiative4.

Future work proceeds in at least three directions. The first direction aims at takinginto account some novel matching approaches which exploit schema-level information,e.g., [1, 14, 38, 45, 54]. As it has already been mentioned, in some applications (e.g.,agents communication, web services integration) there are no instances given before-hand, and therefore, schema-based matching is an only solution for such cases. How-ever, in the other applications (e.g., schema/ontology integration), instances are givenbeforehand, and therefore, instance-based approaches should be considered as a partof the solution space. Thus, the second direction of the future work aims at extendingour classification by taking into account instance-based approaches, e.g., [17, 23, 40].Finally, the third direction aims at conducting an extensive evaluation of matching sys-tems by systematic benchmarks and by case studies on the industrial size problems.

Acknowledgements: This work has been partly supported by the Knowledge WebEuropean network of excellence (IST-2004-507482).

References

1. Z. Aleksovski, W. ten Kate, and F. van Harmelen. Semantic coordination: a new approxima-tion method and its application in the music domain. In Proceedings of the Meaning Coor-dination and Negotiation workshop at the International Semantic Web Conference (ISWC),2004.

2. D. Aumuller, H. H. Do, S. Massmann, and E. Rahm. Schema and ontology matching withCOMA++. In Proceedings of the International Conference on Management of Data (SIG-MOD), Software Demonstration, 2005.

3. C. Batini, M. Lenzerini, and S. B. Navathe. A comparative analysis of methodologies fordatabase schema integration. ACM Computing Surveys, 18(4):323–364, 1986.

4. H. W. Beck, S. K. Gala, and S. B. Navathe. Classification as a query processing techniquein the CANDIDE semantic data model. In Proceedings of the International Conference onData Engineering (ICDE), pages 572–581, 1989.

5. S. Bergamaschi, S. Castano, and M. Vincini. Semantic integration of semistructured andstructured data sources. SIGMOD Record, (28(1)):54–59, 1999.

6. J. Berlin and A. Motro. Autoplex: Automated discovery of content for virtual databases. InProceedings of the International Conference on Cooperative Information Systems (CoopIS),pages 108–122, 2001.

7. J. Berlin and A. Motro. Database schema matching using machine learning with featureselection. In Proceedings of the International Conference on Advanced Information SystemsEngineering (CAiSE), pages 452–466, 2002.

8. P. Bernstein, S. Melnik, M. Petropoulos, and C. Quix. Industrial-strength schema matching.SIGMOD Record, (33(4)):38–43, 2004.

9. D. Le Berre. A satisfiability library for Java. http://www.sat4j.org/.10. A. Borgida, R. Brachman, D. McGuinness, and L. Resnick. CLASSIC: A structural data

model for objects. SIGMOD Record, 18(2):58–67, 1989.

2 http://www.atl.external.lmco.com/projects/ontology/i3con.html3 http://oaei.inrialpes.fr/2004/Contest/4 http://oaei.inrialpes.fr/2005/

11. P. Bouquet, J. Euzenat, E. Franconi, L. Serafini, G. Stamou, and S. Tessaris. D2.2.1: Specifi-cation of a common framework for characterizing alignment. Technical report, NoE Knowl-edge Web project delivable, 2004. http://knowledgeweb.semanticweb.org/.

12. P. Bouquet, L. Serafini, and S. Zanobini. Semantic coordination: A new approach and anapplication. In Proceedings of the International Semantic Web Conference (ISWC), pages130–145, 2003.

13. S. Castano, V. De Antonellis, and S. De Capitani di Vimercati. Global viewing of heteroge-neous data sources. IEEE Transactions on Knowledge and Data Engineering, (13(2)):277–297, 2001.

14. S. Castano, A. Ferrara, S. Montanelli, and G. Racca. Semantic information interoperabilityin open networked systems. In Proceedings of the International Conference on Semantics ofa Networked World (ICSNW), in cooperation with ACM SIGMOD, pages 215–230, 2004.

15. Meta Data Coalition. Open information model, version 1.0. http://mdcinfo/oim/oim10.html,August 1999.

16. W. Cohen, P. Ravikumar, and S. Fienberg. A comparison of string metrics for matchingnames and records. In Proceedings of the workshop on Data Cleaning and Object Consol-idation at the International Conference on Knowledge Discovery and Data Mining (KDD),2003.

17. R. Dhamankar, Y. Lee, A. Doan, A. Halevy, and P. Domingos. iMAP: Discovering complexsemantic matches between database schemas. In Proceedings of the International Confer-ence on Management of Data (SIGMOD), pages 383–394, 2004.

18. T. Di Noia, E. Di Sciascio, F. M. Donini, and M. Mongiello. A system for principled match-making in an electronic marketplace. In Proceedings of the World Wide Web Conference(WWW), pages 321–330, 2003.

19. R. Dieng and S. Hug. Comparison of ”personal ontologies” represented through conceptualgraphs. In Proceedings of the European Conference on Artificial Intelligence (ECAI), pages341–345, 1998.

20. H. H. Do, S. Melnik, and E. Rahm. Comparison of schema matching evaluations. In Pro-ceedings of the workshop on Web and Databases, 2002.

21. H. H. Do and E. Rahm. COMA - a system for flexible combination of schema matchingapproaches. In Proceedings of the Very Large Data Bases Conference (VLDB), pages 610–621, 2001.

22. A. Doan, P. Domingos, and A. Halevy. Reconciling schemas of disparate data sources: Amachine-learning approach. In Proceedings of the International Conference on Managementof Data (SIGMOD), pages 509–520, 2001.

23. A. Doan, J. Madhavan, P. Domingos, and A. Halevy. Learning to map ontologies on thesemantic web. In Proceedings of the International World Wide Web Conference (WWW),pages 662–673, 2003.

24. D. Dou, D. McDermott, and P. Qi. Ontology translation on the Semantic Web. Journal onData Semantics (JoDS), II:35–57, 2005.

25. M. Ehrig and S. Staab. QOM: Quick ontology mapping. In Proceedings of the InternationalSemantic Web Conference (ISWC), pages 683–697, 2004.

26. M. Ehrig and Y. Sure. Ontology mapping - an integrated approach. In Proceedings of theEuropean Semantic Web Symposium (ESWS), pages 76–91, 2004.

27. J. Euzenat. An API for ontology alignment. In Proceedings of the International SemanticWeb Conference (ISWC), pages 698–712, 2004.

28. J. Euzenat, J. Barrasa, P. Bouquet, R. Dieng, M. Ehrig, M. Hauswirth, M. Jarrar, R. Lara,D. Maynard, A. Napoli, G. Stamou, H. Stuckenschmidt, P. Shvaiko, S. Tessaris, S. van Acker,I. Zaihrayeu, and T. L. Bach. D2.2.3: State of the art on ontology alignment. Technical report,NoE Knowledge Web project delivable, 2004. http://knowledgeweb.semanticweb.org/.

29. J. Euzenat and P. Valtchev. An integrative proximity measure for ontology alignment. InProceedings of the Semantic Integration workshop at the International Semantic Web Con-ference (ISWC), 2003.

30. J. Euzenat and P. Valtchev. Similarity-based ontology alignment in OWL-lite. In Proceedingsof the European Conference on Artificial Intelligence (ECAI), pages 333–337, 2004.

31. A. Gangemi, N. Guarino, C. Masolo, and A. Oltramari. Sweetening WordNet with DOLCE.AI Magazine, (24(3)):13–24, 2003.

32. F. Giunchiglia and P. Shvaiko. Semantic matching. The Knowledge Engineering ReviewJournal (KER), (18(3)):265–280, 2003.

33. F. Giunchiglia, P. Shvaiko, and M. Yatskevich. S-Match: an algorithm and an implementationof semantic matching. In Proceedings of the European Semantic Web Symposium (ESWS),pages 61–75, 2004.

34. F. Giunchiglia, P. Shvaiko, and M. Yatskevich. Semantic schema matching. Technical ReportDIT-05-014, University of Trento, 2005.

35. F. Giunchiglia and M. Yatskevich. Element level semantic matching. In Proceedings ofMeaning Coordination and Negotiation workshop at the International Semantic Web Con-ference (ISWC), 2004.

36. F. Giunchiglia and I. Zaihrayeu. Making peer databases interact - a vision for an architecturesupporting data coordination. In Proceedings of the International workshop on CooperativeInformation Agents (CIA), pages 18–35, 2002.

37. N. Guarino. The role of ontologies for the Semantic Web (and beyond). Technical report,Laboratory for Applied Ontology, Institute for Cognitive Sciences and Technology (ISTC-CNR), 2004.

38. B. He and K. C.-C. Chang. A holistic paradigm for large scale schema matching. SIGMODRecord, 33(4):20–25, 2004.

39. Y. Kalfoglou and M. Schorlemmer. Ontology mapping: the state of the art. The KnowledgeEngineering Review Journal (KER), (18(1)):1–31, 2003.

40. J. Kang and J. F. Naughton. On schema matching with opaque column names and datavalues. In Proceedings of the International Conference on Management of Data (SIGMOD),pages 205–216, 2003.

41. V. Kashyap and A. Sheth. Semantic and schematic similarities between database objects:a context-based approach. The International Journal on Very Large Data Bases (VLDB),5(4):276–304, 1996.

42. J. A. Larson, S. B. Navathe, and R. Elmasri. A theory of attributed equivalence in databaseswith application to schema integration. IEEE Transactions on Software Engineering,15(4):449–463, 1989.

43. M. Lenzerini. Data integration: A theoretical perspective. In Proceedings of the Symposiumon Principles of Database Systems (PODS), pages 233–246, 2002.

44. W. S. Li and C. Clifton. Semantic integration in heterogeneous databases using neural net-works. In Proceedings of the Very Large Data Bases Conference (VLDB), pages 1–12, 1994.

45. J. Madhavan, P. Bernstein, A. Doan, and A. Halevy. Corpus-based schema matching. InProceedings of the International Conference on Data Engineering (ICDE), pages 57–68,2005.

46. J. Madhavan, P. Bernstein, and E. Rahm. Generic schema matching with Cupid. In Proceed-ings of the Very Large Data Bases Conference (VLDB), pages 49–58, 2001.

47. A. Maedche, B. Motik, N. Silva, and R. Volz. MAFRA - A MApping FRAmework forDistributed Ontologies. In Proceedings of the International Conference on Knowledge En-gineering and Knowledge Management (EKAW), pages 235–250, 2002.

48. A. Maedche and S. Staab. Measuring similarity between ontologies. In Proceedings of theInternational Conference on Knowledge Engineering and Knowledge Management (EKAW),pages 251–263, 2002.

49. D. L. McGuinness, R. Fikes, J. Rice, and S. Wilder. An environment for merging and test-ing large ontologies. In Proceedings of the International Conference on the Principles ofKnowledge Representation and Reasoning (KR), pages 483–493, 2000.

50. S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity flooding: A versatile graph matchingalgorithm. In Proceedings of the International Conference on Data Engineering (ICDE),pages 117–128, 2002.

51. S. Melnik, E. Rahm, and P. Bernstein. Rondo: A programming platform for generic modelmanagement. In Proceedings of the International Conference on Management of Data (SIG-MOD), pages 193–204, 2003.

52. E. Mena, V. Kashyap, A. Sheth, and A. Illarramendi. OBSERVER: An approach for queryprocessing in global information systems based on interoperability between pre-existing on-tologies. In Proceedings of the International Conference on Cooperative Information Sys-tems (CoopIS), pages 14–25, 1996.

53. A. G. Miller. WordNet: A lexical database for English. Communications of the ACM,(38(11)):39–41, 1995.

54. P. Mitra, N. Noy, and A. Jaiswal. OMEN: A probabilistic ontology mapping tool. In Proceed-ings of the Meaning Coordination and Negotiation workshop at the International SemanticWeb Conference (ISWC), 2004.

55. I. Niles and A. Pease. Towards a standard upper ontology. In Proceedings of the InternationalConference on Formal Ontology in Information Systems (FOIS), pages 2–9, 2001.

56. N. Noy and M. Klein. Ontology evolution: Not the same as schema evolution. Knowledgeand Information Systems, 2002.

57. N. Noy and M. Musen. Anchor-PROMPT: using non-local context for semantic matching.In Proceedings of the workshop on Ontologies and Information Sharing at the InternationalJoint Conference on Artificial Intelligence (IJCAI), pages 63–70, 2001.

58. L. Palopoli, G. Terracina, and D. Ursino. The system DIKE: Towards the semi-automaticsynthesis of cooperative information systems and data warehouses. In ADBIS-DASFAA,Matfyzpress, pages 108–117, 2000.

59. M. Paolucci, T. Kawmura, T. Payne, and K. Sycara. Semantic matching of web servicescapabilities. In Proceedings of the International Semantic Web Conference (ISWC), pages333–347, 2002.

60. C. Parent and S. Spaccapietra. Database integration: the key to data interoperability. InM. P. Papazoglou, S. Spaccapietra, and Z. Tari, editors, Advances in Object-Oriented DataModeling. The MIT Press, 2000.

61. R. Rada, H. Mili, E. Bicknell, and M. Blettner. Development and application of a metric onsemantic nets. IEEE Transactions on Systems, Man and Cybernetics, (19(1)):17–30, 1989.

62. E. Rahm and P. Bernstein. A survey of approaches to automatic schema matching. TheInternational Journal on Very Large Data Bases (VLDB), (10(4)):334–350, 2001.

63. E. Rahm, H. H. Do, and S. Maßmann. Matching large XML schemas. SIGMOD Record,33(4):26–31, 2004.

64. P. Resnik. Using information content to evaluate semantic similarity in a taxonomy. InProceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages448–453, 1995.

65. D. Shasha, J. T. L. Wang, and R. Giugno. Algorithmics and applications of tree and graphsearching. In Proceedings of the Symposium on Principles of Database Systems (PODS),pages 39–52, 2002.

66. A. Sheth and J. Larson. Federated database systems for managing distributed, heterogeneous,and autonomous databases. ACM Computing Surveys, 22(3):183–236, 1990.

67. P. Shvaiko. A classification of schema-based matching approaches. In Proceedings of theMeaning Coordination and Negotiation workshop at the International Semantic Web Con-ference (ISWC), 2004.

68. A. Sotnykova, C. Vangenot, N. Cullot, N. Bennacer, and M.-A. Aufaure. Semantic mappingsin description logics for spatio-temporal database schema integration. Journal on Data Se-mantics (JoDS), Special Issue on Semantic-based Geographical Information Systems, III,2005.

69. Y. Sure, O. Corcho, J. Euzenat, and T. Hughes. Evaluation of Ontology-based Tools. Pro-ceedings of the International Workshop on Evaluation of Ontology-based Tools (EON),2004. http://CEUR-WS.org/Vol-128/.

70. M. Uschold and M. Gruninger. Ontologies and semantics for seamless connectivity. SIG-MOD Record, 33(4):58–64, 2004.

71. P. Valtchev. Construction automatique de taxonomies pour l’aide a la representation deconnaissances par objets. These d’informatique, Universite Grenoble 1, 1999.

72. P. Valtchev and J. Euzenat. Dissimilarity measure for collections of objects and values.Lecture Notes in Computer Science, 1280:259–272, 1997.

73. R. van Eijk, F. de Boer, W. van de Hoek, and J. J. Meyer. On dynamically generated ontologytranslators in agent communication. International Journal of Intelligent System, 16:587–607,2001.

74. Y. Velegrakis, R. J. Miller, and J. Mylopoulos. Representing and querying data transforma-tions. In Proceedings of the International Conference on Data Engineering (ICDE), pages81–92, 2005.

75. H. Wache, T. Voegele, U. Visser, H. Stuckenschmidt, G. Schuster, H. Neumann, and S. Hueb-ner. Ontology-based integration of information - a survey of existing approaches. In Pro-ceedings of the workshop on Ontologies and Information Sharing at the International JointConference on Artificial Intelligence (IJCAI), pages 108–117, 2001.

76. L. Xu and D. W. Embley. Using domain ontologies to discover direct and indirect matches forschema elements. In Proceedings of the Semantic Integration workshop at the InternationalSemantic Web Conference (ISWC), 2003.

77. K. Zhang and D. Shasha. Approximate tree pattern matching. In A. Apostolico and Z. Galil,editors, Pattern matching in strings, trees, and arrays, pages 341–371. Oxford University,1997.

Documents

A Survey of Schema-based Matching Approachesdisi.unitn.it/~p2p/RelatedWork/Matching/JoDS-IV... · A Survey of Schema-based Matching Approaches Pavel Shvaiko1 and J´erˆome Euzenat2