Aggregated search of data and services

Contents lists available at ScienceDirect

Information Systems

Information Systems 36 (2011) 134–150

0306-43

doi:10.1

� Cor

E-m

antonio

(A. Ma

(F. Gue

(G. Fris

journal homepage: www.elsevier.com/locate/infosys

Aggregated search of data and services

Matteo Palmonari b,�, Antonio Sala a, Andrea Maurino b, Francesco Guerra a,Gabriella Pasi b, Giuseppe Frisoni b

a Universit �a di Modena e Reggio Emilia, Via Vignolese 905, 41125 Modena, Italyb Universit�a di Milano Bicocca, Viale Sarca 336, 20126 - Milano, Italy

a r t i c l e i n f o

Keywords:

Aggregated search

Data integration

Semantic Web Service discovery

Information retrieval

Web Services

79/$ - see front matter & 2010 Elsevier B.V. A

016/j.is.2010.09.003

responding author.

ail addresses: [email protected] (M.

[email protected] (A. Sala), [email protected]

urino), [email protected]

rra), [email protected] (G. Pasi), frisoni@d

oni).

a b s t r a c t

From a user perspective, data and services provide a complementary view of an

information source: data provide detailed information about specific needs, while

services execute processes involving data and returning an informative result as well.

For this reason, users need to perform aggregated searches to identify not only relevant

data, but also services able to operate on them. At the current state of the art such

aggregated search can be only manually performed by expert users, who first identify

relevant data, and then identify existing relevant services.

In this paper we propose a semantic approach to perform aggregated search of data

and services. In particular, we define a technique that, on the basis of an ontological

representation of both data and services related to a domain, supports the translation of

a data query into a service discovery process.

In order to evaluate our approach, we developed a prototype that combines a data

integration system with a novel information retrieval-based Web Service discovery

engine (XIRE). The results produced by a wide set of experiments show the effectiveness

of our approach with respect to IR approaches, especially when Web Service

descriptions are expressed by means of a heterogeneous terminology.

& 2010 Elsevier B.V. All rights reserved.

1. Introduction

In the knowledge society, users need to both accessdata and invoke services offered on the Internet. Datasearch and service invocation can be individually carriedout, and both these operations provide value to users inthe context of complex interactions. However, typically,users need to perform aggregated searches able to identifynot only relevant data, but also services able to operate onthem. At the state of the art, such aggregated search canbe performed only by expert users, who first identify

ll rights reserved.

Palmonari),

imib.it

isco.unimib.it

relevant data, and then locate existing relevant services.Search for data on a specific domain can be eased by dataintegration systems [1] able to federate heterogeneousdata sources by offering to users a mediated schema thatcan be exploited by means of SQL-like queries.

The advent of Service Computing and Semantic WebService technologies, instead, provides specific techniquesto service discovery. These techniques are often veryspecifics and require a quite relevant effort to be learntand used. The research on data integration and semanticservice discovery has involved from the beginning different(not always overlapping) communities. As a consequence,data and services are described with different models,and different techniques to retrieve data and serviceshave been developed. Nevertheless, from a user perspec-tive, the border between data and services is oftennot so definite, since data and services provide acomplementary view of the available resources: data

www.elsevier.com/locate/infosys

dx.doi.org/10.1016/j.is.2010.09.003

mailto:[email protected]









dx.doi.org/10.1016/j.websem.2003.07.002

M. Palmonari et al. / Information Systems 36 (2011) 134–150 135

provide detailed information about specific needs, whileservices execute processes involving data and returningan informative result.

Users need new techniques to effectively access dataand services in a unified manner: both the richness of theinformation available on the Web and the difficultiesthe user faces in gathering such information (as a serviceresult, as a query on a form, as a query on a data sourcecontaining information extracted from web sites) make atool for jointly search data and related services, with thesame language, really necessary. Aggregated search is ayoung research field, and very few approaches have beenproposed. Search computing [2] is a novel discipline thegoal of which is to answer to complex, multi-domainqueries. ‘‘This research activity is support a new genera-tion of pure data oriented user queries such as: ‘‘Wherecan I attend a scientific database conference in a citywithin a hot region served by luxury hotels and reachablewith cheap flights?’’ While a more ambitious goal is toenrich the above-mentioned query evaluation by offeringnot only the list of scientific DB conferences with therelated locations, hotel names and flight information, butalso the list of services able to book hotels and flights. In[3], the authors consider as aggregated search the task ofsearching and assembling information from a variety ofsources, presenting it in a single interface.

In this paper we propose a new interpretation ofaggregated search, i.e. the integration of distinct searchmodalities, on distinct sources and by distinct searchspaces, obtained by a single query over a unifiedrepresentation of data and services and producingaggregated results. The most important outcomes of theresearch reported in this paper are: (i) techniques forintegrating data and service knowledge, (ii) handlingstructural heterogeneity (queries are expressed in alanguage different from service description languages),and terminological heterogeneity (queries are expressedby syntactically and semantically different terms w.r.t. theones adopted in service descriptions), and (iii) a newontology modularization algorithm. A framework archi-tecture comprising the data ontology (DO) manager,which supports the common knowledge extractedfrom heterogeneous sources, and XIRE, an informationretrieval-based Web Service engine able to provide a listof ranked services according to a set of weightedkeywords are designed and developed to evaluate theeffectiveness of the overall approach. Evaluations basedon state-of-the-art benchmarks for semantic Web Servicediscovery show that our information retrieval-basedapproach provides good results in terms of recall andprecision.

The outline of the paper is the following: Section 2introduces a motivating scenario that will be adopted as arunning example. Section 3 describes the approachdeveloped for publishing unified contents of data andservices, Section 4 provides the description of thetechnique for aggregating data and services. In Section 5the framework is discussed and in Section 6 the evalua-tion is presented. Related works are discussed in Section7, and finally, conclusions and future work are drawn inSection 8.

2. Motivating scenario and running example

Let us assume that a virtual touristic district iscomposed of a set of touristic companies (including travelagencies, hotels, local public administrations, touristicpromotion agencies) creating a semantic peer in whichthey want to share and expose an integrated view oftouristic data and services. The semantic peer wants tosupply the tourist promoters and travelers with all theinformation about a location by means of only one toolmanaging both data and services provided by differentweb sources.

Let us introduce as an example three informationsystems about Italian locations that may be integrated tocreate a larger information source available for touristicpurposes:

�
BookAtMe provides information about more than30,000 hotels located in more than 8000 destinations.For each hotel, information is provided about facilities,prices, policies, etc. Some services are also available forchecking the availability of a room and booking it. � Touring provides information about Italian hotels,
restaurants and cities. By means of three differentforms, it is possible to find available accommodations,restaurants (described with some specific features)and main monuments for each city.
� TicketOne provides information about Italian cultural
events. For each event, a description including place,price and details is provided. The information systemoffers services to check the ticket availability of aparticular event and, also, to buy tickets.

A global view of the data sources provided by each ofthese information systems is created by means of a dataintegration system and shown in Fig. 1. To give an idea ofthe global view obtained, there are accommodations andrestaurants in certain locations. A restaurant can beassociated with a certain accommodation, like a hotelrestaurant. Every accommodation has certain facilities,where some kind of activities can be performed. Events

take place in a location, and vacation packages areavailable at special rates for some accommodation toattend special events. Information about customers andabout bookings made by customers for certain accommo-dations are also represented. Moreover, cars are availablefor rental in certain locations.

Now let us consider a query Q1 to find the name ofaccommodations available in Modena.

select Name, City, Country

from Accommodation

where Accommodation.City = ‘Modena’

While the problem of finding relevant data for suchquery is well defined, an important problem is to retrieve,among the many services available and related to theseveral domains mentioned above, the ones that arepossibly related to Q1, according to the semantics of theterms involved in the query.

ID: intAccommodation_ID: intZip_code: intOrganization: stringType: stringDescription: stringNote: string

Activity

Customer_ID: intAccommodation_ID: int

Booking

ID: intAccommodation_ID: intFacility_Name: stringActivity_ID: intDescription: stringNote: string

Facility

License: intMake: stringModel: stringColor: stringZip_code: int

Car

Zip_Code: intName: stringCountry: stringRegion: stringMayor: stringPopulation: intArea: intMain_Sights: string

Location

ID: intName: stringAddress: stringTelephone: stringZip_Code: intEmail: string

Customer

ID: intEvent_Name: stringDate: dateDescription: stringZip_Code: int

Event

ID: intName: stringAddress: stringCity: stringZip_Code: intCountry: stringTelephone: intWeb_Site: string

Accommodation

ID: intRestaurant_Name: stringAccommodation_ID: stringZip_Code: intAddress: stringTelephone: intWeb_Site: string

Restaurant

ID: intAccommodation_ID: stringMain_Activity_ID: intZip_Code: intAddress: stringTelephone: intWeb_Site: string

Vacation_Package

Fig. 1. The tourist integrated schema.

M. Palmonari et al. / Information Systems 36 (2011) 134–150136

3. Building the global data and service view

In order to perform aggregated search there is the needof a Unique Virtual View (UVV) i.e. a unified representa-tion of data and services. A UVV is made up of thefollowing components:

�
a data ontology (DO), i.e. a common representation ofall the data sources belonging to the peer; the DO isbuilt by means of the MOMIS (Mediator envirOnmentfor Multiple Information Sources) data integrationsystem, as described in Section 3.1. � a global light service ontology (GLSO) that provides, by
means of a set of concepts and attributes, a global viewof all the concepts and attributes used for thedescription of the Web Service available in the peer;
�
1 http://wordnet.princeton.edu/2 http://www.odmg.org/

a set of mappings which connect GLSO elements to DOelements.

3.1. Building an integrated representation of data: the data

ontology

The data ontology (DO) is a virtual integrated view ofthe schemas of the underlying data sources. It is obtainedby exploiting the lexical and structural knowledge of thelocal source schemas to semi-automatically group togethersimilar or semantically related elements appearing indifferent information sources. The MOMIS system handlesthis process by defining a common thesaurus (CT) composedof inter- and intra-schema relationships among classes

and attributes appearing in the local sources, automati-cally created extracting different kinds of relationshipsfrom the sources: structural relationships are directlyderived from the source schemas (e.g. a foreign key in arelational database); lexical relationships are obtainedfrom the relationships existing in the WordNet1 databasebetween the meanings associated to the source elementsin a semi-automatic semantic annotation phase; otherrelationships can be inferred by means of descriptionlogics techniques.

MOMIS represents both the data source schemas andthe integrated global schema in a common data model,called ODLI3 , which is an extension of the ODL language,an object-oriented language developed by ODMG.2 ODLI3

allows to represent different kinds of data sources as a setof classes and attributes and is transparently translatedinto a description logic [4,1]. ODLI3 allows to formallyrepresent the data source schemas and the integratedview, as well as lexical knowledge about them: usualthesaurus relationships such as synonymy, broader ornarrower term or related term relationships can beexpressed. ODLI3 also allows to represent Mapping Rulesto express relationships between the integrated view ofthe information sources and the schema description of theoriginal sources.

Using ODLI3 for representing sources and ontologies is nota limitation: the interoperability of the ODLI3 descriptions is

http://wordnet.princeton.edu/

http://www.odmg.org/


guaranteed by specific wrapper modules able to translate,without loss of semantics for the integration purposes, suchdescriptions into the languages for describing sources andontologies on the web, i.e. OWL, RDF, XML (Schema).3

The MOMIS integration process for building the DO iscomposed of five phases:

1.
Local source schemata extraction. Wrappers analyzesources in order to extract (or generate if the source isnot structured) schemas. Such schemas are thentranslated into the common language ODLI3 .
2.
Local source annotation with WordNet. The systemautomatically suggests a meaning, i.e. a synset takenfrom the WordNet lexical ontology, for each element ofa local source schema. Some techniques are imple-mented in MOMIS for achieving this goal [5]. A GUIsupports the integration designer in reviewing and,if necessary, correcting the proposed annotations.MOMIS also allows the user to extend the WordNetontology by adding new concepts and relating them tothe native elements of WordNet.
3.
Common thesaurus generation. Starting from the anno-tated local schema, MOMIS constructs a set of thesaurusrelationships describing inter- and intra-schema knowl-edge about classes and attributes of the source schemas.The CT is incrementally built starting from schema-derived relationships, i.e. automatic extraction of intra-schema relationships from each schema separately.Then, the relationships existing in the WordNet data-base between the annotated meanings are exploitedby generating relationships between the respectiveelements that are called lexicon-derived relationships.The integration designer may add new relationships tocapture specific domain knowledge, and finally, bymeans of a description logics reasoner called ODB-Tools,which performs equivalence and subsumption compu-tation, the MOMIS system infers new relationships andcomputes the transitive closure of CT relationship.
4.
DO generation. This methodology generates an affinitymatrix with the similarity measure between theelements of the sources based on the relationshipscontained in the CT. A hierarchical clustering techni-que applied to this affinity matrix groups similarelements of different sources, exploited for generatinga global schema and sets of mappings with localschemas. The DO is made up of a set of global classes.Several global attributes belong to a global class.
5.
DO annotation. By exploiting the annotated localschemata and the mappings between local and globalschemas, the MOMIS system semi-automaticallyassigns a name and a meaning to each element of theglobal schema.
The result is thus an ontology made of a set of globalclasses, each composed of global attributes, that is arepresentation of the knowledge provided by the availabledata sources.

3 http://www.w3.org/

Example 3.1. Let us consider three local sources:L1.HOTEL (ID, NAME, TELEPHONE, ADDRESS, CITY, ZIP,COUNTRY, WEB-SITE), L2.LODGE (ID, DENOMINATION,TEL, ADDRESS, CITY, ZIPCODE, WWW) and L3.ACCOMMODATION (ID, HOTEL_NAME, PHONE, ADDRESS,ZIP_CODE). These three local classes are represented inODLI3 as:

Source L1

interface hotel() {

a
ttribute string id;
a
ttribute string Name;
a
ttribute string Telephone;
a
ttribute string Address;
a
ttribute string City;
a
ttribute string Zip;
a
ttribute string Country;
a
ttribute string Web-site;
}

Source L2

interface lodge() {

a
ttribute integer id;
a
ttribute string denomination;
a
ttribute string tel;
a
ttribute string address;
a
ttribute string city;
a
ttribute string zipcode;
a
ttribute string www;
}

Source L3

interface accommodation() {

a
ttribute integer id;
a
ttribute string Hotel_Name;
a
ttribute string phone;
a
ttribute string address;
a
ttribute string zip_code;
}

In the annotation phase, the element L1.hotel is

annotated with the WordNet synset a building where

travelers can pay for lodging and meals and other services,

while the element L2.lodge is annotated with the synset a

hotel providing overnight lodging for travelers. Since lodge is

a hyponym of hotel in WordNet, in the CT a NT (narrower

term) relationship is generated between lodge and hotel.

We defined L1.hotel as a hyponym of L3.accommodation,

thus leading to an NT relationship between these two

elements in the CT. Another NT relationship between

lodge and accommodation can thus be inferred. Being very

similar, these three local classes are grouped together by

the clustering algorithm, generating the global class

Accommodation. This global class is composed of the

union of the attributes coming from the three local

sources it is mapped on. The global class is then

represented in ODLI3 as:

interface Accommodation {

a
ttribute id integer,
a
ttribute Name string,
a
ttribute Address string,
a
ttribute City integer,
a
ttribute Zip_Code integer,
a
ttribute Country integer,
a
ttribute Telephone integer
a
ttribute Web_Site string,
}

http://www.w3.org/

Fig. 2. A sketch of the overall process for building the GLSO.


Finally, the global class is automatically annotated

starting from the annotations provided to the local classes

that compose it and by considering the mappings

between the local and the global schemas.

3.2. GLSO construction

The global light service ontology is built by means of atwo steps procedure: (i) service indexing, and (ii) globalservice ontology (GSO) construction. A sketch of theoverall process is given in Fig. 2.

3.2.1. Service indexing

In order to retrieve services, an information retrievalapproach is applied to the semantic descriptions of WebServices (WS). In this paper we consider OWL-S4 as thesemantic Web Service description language, but theapproach can be easily generalized to other lightersemantic annotation languages for Web Service such asSAWSDL5; in both the languages, the semantic descrip-tions of services refer to OWL domain service ontologies(SOs). The IR approach (aimed at locating relevant servicesrelated to the SQL query) requires a formal representationof the service descriptions stored in the repository, and itis based on full text indexing which extracts terms fromsix specific sections of a service description: service name,service description, input, output, pre-condition and effects.As an example, the service ‘‘City_Hotel_Service’’ from theOLWS-TC collection imports two domain ontologies(Travel and Portal) and is described by: the name‘‘CityHotelInfoService’’, the text description ‘‘This servicereturns information of a hotel of a given city’’, and theontology concepts City (from the Portal ontology), and

4 http://www.w3.org/Submission/OWL-S/5 http://www.w3.org/2002/ws/sawsdl/

Hotel (from the Travel ontology) as input and output,respectively. While the service name and descriptionconsist of short text sections, both input and output referto domain SOs, namely, a Portal and a Travel ontology(pre-condition and effects are represented analogouslybut are seldom used and missed in the example).

Given a collection of OWL-S service descriptions, theindexing process extracts a set of index terms I whichcontains two subsets: the set IOD I of ontological index

terms, i.e. terms taken from ontological resources (e.g. City

and Hotel in the example), and the set IT D I of textual

index terms, i.e. terms extracted from the textual descrip-tions (e.g. ‘‘information’’, ‘‘hotel’’, ‘‘city’’). Observe that thesets IO and IT are disjoint by construction becauseidentifiers of ontological index terms are URI. Theindexing structure is based on a ‘‘structured document’’approach, where the structure of the document consists ofthe six aforementioned sections. The inverted file struc-ture consists of (i) a dictionary file based on I, and (ii) aposting file, with a list of references (one for each term inthe dictionary) to the services’ sections where theconsidered term occurs. The posting file is organized tostore, for each index term i 2 I, a list of blocks containing(i) the service identifier, and (ii) the identifiers of theservice’s section in which the term appears. Each blockhas a variable length, as a term appears in at least onesection of a service in the posting list, but it may appear inmore than one section. In the usual IR approach to textindexing an index term weight is computed, whichquantifies the informative role of the term in theconsidered document. As descriptions are usually quiteshort texts, we do not compute index-term weights.Instead, we propose an approach to sections weighting, inorder to enhance the informative role of term occurrencesin the distinct sections of the service. These importanceweights will be used in the query evaluation phase inorder to rank the services retrieved as a response to auser’s query as explained in Section 4.

http://www.w3.org/Submission/OWL-S/

http://www.w3.org/2002/ws/sawsdl/

Fig. 3. A sketch of the neighborhood identified by the PaNCH algorithm

on a sample ontology.


3.2.2. GSO construction

To construct the global service ontology (GSO) theprocedure first juxtaposes each service ontology O (SO)such that there exists some i 2 IO with i 2 O (e.g. the Portaland Travel ontologies in the reference example); ontolo-gies can import other ontologies (e.g. the Portal ontologyimports the Mid-level ontology, which imports the SUMOontology), every ontology contained in the transitiveclosure of the import clause is included in the GSO.

With juxtapose we mean that SOs are merged byasserting that their top concepts (e.g. PortalEntity) areall subclasses of Thing without attempting to integratesimilar concepts across the different integrated ontologies.Therefore, if the SOs are consistent, the GSO can beassumed to be consistent because no axioms establishingrelationships among the concepts of the SOs are introduced(e.g. axioms referring to the same intended conceptwould actually refer to two distinct concepts with twodifferent URIs).

Finally, top concepts of the GSO that belong to thesame ontology O are grouped together by defining foreach O a superconcept of all the top concepts of O (e.g. theconcept PortalEntity for the Portal ontology).

3.2.3. Construction of the GLSO

The GSO may result extremely large in size, whichmakes the semi-automatic process of mapping the GSO tothe DO more expensive; moreover, only a subset of theterms of the ontologies (including the imported ones)related to the SWS descriptions are actually relevant. Tosolve this problem a GLSO (global light service ontology)is extracted from the GSO, by reducing the serviceontology size while preserving the ontology structurewith respect to a set of concepts more relevant to theservice descriptions.

To extract the GLSO we provide a new ontologymodule extraction algorithm, named PaNCH (ParametricNeighborhoods based on Concept Hierarchies), whichexploits a traversal-based approach [6]. The algorithm,which takes as input a set of ontology concepts, aims topreserve structural similarity between concepts [7] and tobe highly configurable with respect to the size of theextracted module.

The pseudocode of PaNCH is listed in Algorithm 1. Therationale of the algorithm consists in defining a neighbor-hood of the input concepts, namely the set of index termsIO, as a subtree of the GSO’s concept hierarchy. Considerthe sample ontology sketched in Fig. 3, and assumeconcepts {i1, i2} as input of the algorithm. The algorithmstarts from the subtree of the GSO’s concept hierarchybottom-bounded by the input concepts (e.g. nodes {i1, i2,rt1, rt2} in the figure), and extends it with the siblings ofthe nodes in the subtree (e.g. nodes {rs1, rs2, rs3}); then, aneighborhood of such subtree is built by traversingdownwards the subclass relation through a path of lengthk from the input terms (e.g. k=2 in the figure), and oflength h from the siblings (e.g. h=1 in the figure);properties related to the concepts in the neighborhoodby domain and range restrictions are also included in theneighborhood (properties are not represented in thefigure for the sake of clarity). By removing all the concepts

and properties outside the neighborhood and their relatedaxioms, the algorithm returns an ontology which defines amodule of the GSO, namely the GLSO. More in detail, thealgorithm consists of the following steps.

Algorithm 1. Extracting the GLSO as a module of the GSO

Input:
O : a given ontology - the GSO
T : a set of concepts - the set IO

k : length of the downward traversal of the subclass

hierarchy

from the ontological index terms

h : length of the downward traversal of the subclass

hierarchy

from the sibling concepts

Output:
the ontology defining the extracted module - the GLSO
PaNCH(O,T,k,h)

(1) RT’|; RS’|; RP’|; CN’|

(2) NCH’inferredConceptHiearchyðOÞ

(3) RT’T

(4) foreach x 2 T

(5) RT’RTþancestorNCHðxÞ

(6) foreach x,y : x 2 RT, y =2 RT, and sibling(x,y)

(7) RS’RSþy

(8) CN’RTþRS

(9) foreach x : x 2 T

(10) CN’descendantNCHðx,kÞ

(11) foreach x : x 2 RS

(12) CN’CNþdescendantNCHðx,hÞ

(13) foreach p: :(x such that x 2 IncludedDomainsp and

x 2 CN

(14) RP’RDþp

(15) foreach c 2 O and c =2 CN

(16) delete(c,O) (and all the related axioms in O)

(17) foreach p 2 RP

(18) delete(p,O) (and all the related axioms)

(19) GLSO’O

�
Step 1 in Algorithm 1. The set of support datastructures used by the algorithm are initialized. � Step 2 in Algorithm 1. The Named Concept Hierarchy
(NCH) is a tree defined as a representation of theinferred class hierarchy of the ontology O; the inferredclass hierarchy is computed by a reasoner. Nodes of theNCH tree represent concepts of O, and arcs represent


subclass relations between the concepts. Observe thatbecause of the multiple inheritance, a concept may berepresented by different nodes of the NCH. The treedepicted in Fig. 3 show the NCH of a sample ontology.
� Steps 3–5. The algorithm identifies a subtree of NCH,
which represents the subset of the concept hierarchymore related to the input concepts T, namely, thestrictly Related Tree (RT). RT includes each node ofNHT representing an input concept i 2 T, and everyancestors of i in the NCH.
� Steps 6 and 7. The algorithm identifies a set of related
siblings (RS); RS represents a set of nodes of NCH thatare sibling of the nodes in RT but are not in RTthemselves.
� Steps 8–12. The Neighborhood Concepts tree NC repre-
sents a neighborhood of T; NC is built as subtree ofthe NHC as follows: the nodes RS are appended toRT (step 8); for each input concept i 2 T, k-descendantsin NHC of i are appended to NC (this is representedby the function descendantNCHðx,yÞ in Algorithm1—steps 9 and 10); for each related sibling s 2 RS,the h-descendants in NHC of i are appended to NC(steps 11 and 12).
� Steps 13 and 14. The related property (RP) set, which
preserves in the extracted ontology module propertiesof the concepts represented in the NC, is built.Properties considered to be relevant to the mappingbetween the DO and the GLSO are the propertieswhose domain, where different from Thing, subsumesat least one concept represented in the NC; inparticular, the concepts that are evaluated as relevantfor a property p (identified by the set IncludedDomainsp

in Algorithm 1) are all the concept C such that C isdomain of p (RDFS domain restriction), or C occurs inan axiom of the form CL(p:D, CLZ1p:D for somearbitrary concept D.
� Steps 15–18. These steps delete from the ontology O
the elements (concepts and properties) that are notincluded in the module and all their related axioms.In particular, nodes in the NHC that are not includedin the neighborhood NC are deleted; when a nodeis deleted in the NCH, if there is any other noderepresenting the same concept occurring in anotherbranch, only the related subclass relation is removedat the ontological level, else, the represented conceptand all the axioms where it occurs are deleted from theGSO (steps 15 and 16). Properties that are not elementof the related property set RP are deleted together withall the axioms they occur in. The above-mentioneddeletion operation on concepts and properties isrepresented by the function delete(x,y) in Algorithm1; the operation is not described in details because it issupported by well-known semantic technologiesexploited for the algorithm’s implementation (e.g. bythe Protege APIs6).
� Steps 19. The ontology returned by the algorithm after
the axiom removal represents the extracted module,namely the GLSO.

6 http://protege.stanford.edu/

axioms from the source ontology; as a consequence, the
Observe that the module is extracted by deletion of
GLSO is still an ontology, and in particular, a subontologyof the GSO; in other words, GSOFGLSO holds bymonotonicity of OWL.

The size of the extracted ontology module depends onthe parameters k and h, but also on the input concepts,and, in particular, on their quantity and distributionwithin the concept hierarchy: the higher the position ofthe input concepts in the concept hierarchy, the closer theconcepts are in the hierarchy, and the smaller the size ofthe module is. However, the two parameters offerreasonable control capabilities on the resulting ontologysize. Moreover, the algorithm generally preserves theupper section of a concept hierarchy, because generalconcepts are more likely to be mapped to concepts ofother ontologies when semantic mediation needs to beperformed. Finally, considering a tree whose root is theconcept Thing as a neighborhood for a set of input termspreserves structural similarity between concepts asdefined in [7]; structural similarity metrics have shownto be useful in service retrieval, as shown in [8].

3.3. Mapping of data and service ontologies

Mappings between the elements of the DO and theGLSO are computed by a two step process: first, theclustering algorithm used in MOMIS to compute clustersof overlapping classes provided as input is applied todiscover similar elements in both the ontologies; then, theclusters obtained are analyzed to create the mappingsbetween the elements of each cluster. The clusteringalgorithm relies on an input matrix where the columnsand the rows represent the source schema elements(the DO elements and the names of the GLSO concepts)and the values in the cells measure the relatedness of theschema elements that are in the corresponding row andcolumn. Such relatedness is obtained by evaluating thesyntactic and lexical similarity of the correspondingelements.

The syntactic similarity measures the similarity be-tween the names used for describing the elements.Several string similarity metrics have already beenproposed [9], e.g., Jaccard, Hamming, Levenshtein, etc.As our approach is independent from the similaritymetrics selected we leave this choice to the application.

Nevertheless, string similarity may fail in highlyheterogeneous environments that lack a common vocabu-lary. In these cases, it is critically important to be able tocapture the meaning of a word and not only its syntax. Forthis reason, we employ a lexical similarity measure based onWordNet. We associate to each class and attribute of theDO and the GLSO a corresponding element (synset) inWordNet. Synsets are related to each other by means ofdifferent kinds of relationships (synonyms, hyponyms,hypernyms, etc.). Giving a different weight to the differentexisting relationships, it is possible to measure thesimilarity between two synsets. By considering the synsetsassociated to classes and attributes of the DO and GLSO, itis possible to compute their similarity. Observe that the DO

http://protege.stanford.edu/


classes and attributes are already annotated according tothe process introduced in Section 3.1.

The algorithm for generating the input matrix is highlycustomizable, thus making the user able to select theimportance of some similarity measure according to thesource structures and contents. Moreover, the user mayselect the clustering algorithm threshold in order toobtain big clusters (and consequently less selectiveassociations between the DO and GLSO elements) orclusters where only strictly related elements are grouped.Mappings are then automatically generated analyzing theresult of the clustering process. The following cases arepossible:

�
A cluster contains only DO classes: it is not exploited forthe mapping generation; such a cluster is obtained bysetting a clustering threshold less selective than theone chosen in the DO creation process. � A cluster contains only GLSO classes: it is not exploited
for the mapping generation; it means that there aredescriptions of Web Services which are stronglyrelated to each other.
� A cluster contains classes belonging both to the DO and
the GLSO: this cluster produces for each DO class amapping to each GLSO class. Mappings between theattributes are generated on the basis of their lexicalsimilarity.

Example 3.2. As an example, consider the DO describedin Fig. 1, and a piece of the GLSO concerning the class Hotel

and the attributes this class is domain of; using a dottednotation in the form of ‘‘concept.property’’ this piece ofontology is represented as follows:

Hotel

H
otel.Denomination
H
otel.Location
H
otel.Country
The following mappings are generated with the applica-tion of our technology:

Accommodation –> Hotel

Accommodation.Name –> Hotel.Denomination

Accommodation.City –> Hotel.Location

Accommodation.Country –> Hotel.Country

Clusters also provide another solution to the false

dissimilarities problem: when two terms t1 and t2 of the

GLSO are clustered together, they are mapped to the same

term s of the DO; when a query formulated in the DO

terminology contains the term s, both t1 and t2 will be

extracted as keywords and used to retrieve relevant

services.

4. Data and eService retrieval

The query processing is thus divided into two steps:

�
a data set from the data sources is obtained with a SQLquery processing on the integrated DO view;
�
a set of services related to the query is obtained byexploiting the mapping between DO and GLSOs andthe concept of relevant service mapping, by executingthe IR engine.
Data results are obtained by exploiting the MOMISQuery Manager (see [10] for a complete description)which rewrites the global query as an equivalent set ofqueries expressed on the local schemas (local queries);this query translation is carried out by considering themapping between the DO and the local schema. SinceMOMIS follows a global-as-view (GAV) approach, thequery translation is performed by means of queryunfolding. Results from the local sources are then mergedby exploiting different reconciliation techniques. As thequery processing on an integrated view is already welldescribed in the literature, in the following we focus ourattention on the queries for services.

Services are retrieved by the XIRE (eXtended InformationRetrieval Engine) component, which is a service searchengine based on the vector space model [11]; as explained inthe following subsection, the query consists of two vectors:the first one is defined by keywords and is evaluated on theservices’ textual descriptions; the second one is defined byterms of the GLSO ontology and is evaluated on in thesections of service descriptions where inputs, outputs,preconditions and effects are annotated. The process forthe creation of the query vectors is represented in Fig. 4 anddescribed in details in the following. We now provide a briefexample to give the intuition of the process to the reader.

Example 4.1. Consider the query Q1 introduced inSection 2. In order to construct the above-mentionedquery vectors, the terms appearing in the query are firstidentified by considering the elements in the SELECT,FROM and WHERE clauses of the query. Q1 is thustranslated into

KDO ¼ fName, City, Country, Accommodation, Modenag

In this step we consider both the DO elements and the

values specified in the WHERE conditions. The mappings

between these DO terms and the GLSO are evaluated

to expand KDO into two sets of keywords. A first set

KGLSOT extends KDO with terms of the GLSO considered as

strings (URI are not considered):

KGLSOT ¼ faccommodation, name, city, country, modena,

hotel, denomination, location, countryg:

A second set KGLSOO consists of the URIs of the ontological

index terms:

KGLSOO ¼ ftravel:Hotel, portal:denomination,

travel:location, travel:countryg:

The set KGLSOO is passed to XIRE to search into the input,

output, precondition and effect sections of the documents;

the set KGLSOT is passed to XIRE to search into the service

description sections of the document. Keywords are

transformed into query vectors, and services are retrieved

by exploiting the vector space model.

Fig. 4. The query processing steps and their application to the reference example.


4.1. eService retrieval

Terms extraction. Given a SQL query expressed in theDO terminology, the set KDO of terms extracted from anSQL query consists of: all the classes given in the ‘‘FROM’’clause, all the attributes and the values used in the‘‘SELECT’’ and ‘‘WHERE’’ clauses, and all their rangesdefined by ontology classes. As an example, the set ofterms extracted from the query Q1 introduced in Section2, consists of the set KDO#1 represented in Fig. 4.

Keyword expansion. The set of terms KDO extracted fromthe user query are expanded in a set of keywords KGLSO byexploiting the mappings between the DO and the GLSO.Let us define a data to service ontology mapping functionm : SigDO-PðSigGLSO

Þ, where SigDO is the signature of anontology O. This function, given a term s 2 DO, returns aset of terms TGLSODSigGLSO iff every t 2 TGLSO is in the samecluster of s. Let expðkÞ ¼ fkg [ fx : x 2 mðkÞg; this functionreturns the expansion of a keyword k that includes k itselfand the set of terms in the GLSO mapped the DO termsused in the query. Given a set of keywords KDO={k0, y,km}, this set is therefore expanded into the setKGLSO= {exp(k0),y,exp (km)}.

However, in service descriptions the sections input, output,

preconditions and effects contain only ontology terms repre-sented by URIs, while the sections service name, and service

description contain only text. The set KGLSO is therefore splitinto the two sets KGLSO

T , whose keywords are representedby strings, and KGLSO

O , whose keywords are representedby URIs. By using the mappings described in Section 3.3,

and assuming mðAccommodationÞ ¼Hotel, mðNameÞ ¼

denomination, mðCityÞ ¼ location, mðCountryÞ ¼ country,

mðModenaÞ ¼ | (Hotel is an ontology concept, while the otherGLSO terms are ontology properties), the two sets ofkeywords obtained in the reference example are KGLSO

T #1and KGLSO

O #17 as represented in Fig. 4.Service retrieval. Query evaluation is based on the

vector space model [11]; by this model both documents(that is Web Service descriptions) and queries (extracted

7 Parts of the URI specification are omitted for the sake of clarity.

keywords) are represented as vectors in a n-dimensionalspace (where n is the total number of index termsextracted from the document collections). Each vectorrepresents a document, and it will have weights differentfrom zero for those keywords which are indexes for thatdescription. The value of such weight is computedaccording to the weights of the six sections of the servicedescription in which the keyword appears.

We assume that the implicit constraint specified in auser query, when selecting a query term (a single keyword)is that it must appear in at least one section of a servicedescription in order to retrieve that service. Based on theabove assumptions the weight which at query evaluationtime is associated with a keyword and a service descriptionis equal to the maximum of the weights of the servicesections in which the keyword appears.

5. System architecture

To evaluate our approach we developed a frameworkarchitecture. It is composed of two main softwarecomponents: an extended version of the data integrationsystem MOMIS, and XIRE (eXtended Information RetrievalEngine), the Web Service retrieval engine. The frameworkis shown in Fig. 5.

5.1. MOMIS

MOMIS is a suite of software modules implementingthe integration process described in Section 3.1. Its maincomponents are:

�
Schema Builder which is in charge of generating theglobal virtual view of the data sources and the UVVthat represents both the data sources and the servicesavailable in a peer. � Support Tools, which are a set of components aiming
at enriching the descriptions of the sources andservices with metadata exploited by the SchemaBuilder.

Fig. 5. Prototype architecture.


�
Query Manager, which is in charge of evaluating a userquery solving it with respect to the data sources andthe available services. � Wrappers that are software modules with the role of
managing the interactions with the data sources.

5.1.1. The Schema Builder

The Schema Builder is mainly composed of two modules,one in charge of creating a global virtual view of datasources (i.e., the DO builder) and the second one computingthe mappings from the global view of data into the conceptsthat Web Service refer to (i.e., the UVV builder).

The DO builder implements a hierarchical clusteringtool that groups the descriptions of the data sources. Thiscomponent interacts with the wrappers and the SupportTools to obtain the data source descriptions and themetadata in the common thesaurus representing therelationships among the data sources. The result ofthe process is a DO, which is composed of a set of global

classes, each one with a mapping table showing the localdata source elements represented by the global class.

The UVV builder is a component that provides themappings between the DO and the GLSO (i.e. the ontologyrepresenting the Web Service in a peer) elements. TheUVV builder implements a clustering algorithm thatworks on the basis of the descriptions of the GLSOelements (extracted by a specific wrapper), the UVVdescription and a set of semantic relationships computedby the Support Tools component. The mappings resultingfrom this process express one to one matches of DOclasses and attributes into GLSO concepts and properties.

5.1.2. The Support Tools

This component contains some important modulesexploited by the Schema Builder and the Query Managerat run time. In particular, the Annotation tools are modulesthat associate for each term in the database (table andattribute names) some metadata describing its lexicalmeaning with respect to a lexical reference (in our case


WordNet) and some quality measures about the data(in particular the accuracy, the completeness and thecurrency). Annotations are exploited by the Schema Builderfor finding the similar descriptions of real world objects indifferent sources and by the Query Manager for selectingthe most promising source when more than one sourcecontain data about the same domain. The CommonThesaurus Builder is the component in charge of creatingand managing a set of inter- and intra-source relationshipsbetween the available sources. Some relationships are builtby exploiting the annotations, some other relationships arecomputed by exploiting description logics techniquesimplemented in the ODB-tool component.

5.1.3. The Query Manager

The MOMIS Query Manager is the component incharge of solving a user query, i.e. executing the queryover the DO and extracting some relevant keywords to beexploited by XIRE for retrieving the related Web Service.

Concerning the query execution on the data sources,when the component receives a query, it rewrites the globalquery as an equivalent set of queries expressed on the localschemas (local queries); this query translation is carried outby considering the mappings between the DO and the localschemas. Since MOMIS follows a global as view (GAV)approach, where the contents of the mediated schema isexpressed in terms of the local sources schemas, thismapping is expressed by specifying, for each global class C,a mapping query QC over the schemas of the local classesbelonging to C. The system automatically generates the map-ping query QC, by extending the full disjunction (FD) operator[12] and exploiting the Data Transformation Functions, whichare defined by the user and represent the mapping of localattributes into the DO attributes. The query translation is thusperformed by means of query unfolding, i.e. by expanding aglobal query on a global class C of the DO according to thedefinition of the mapping query QC. Results from the localsources are then merged exploiting reconciliation techniquesproposed by [13] and proposed to the user [10].

The relevant keywords extracted from a query are theone identifying schema elements and searched values. Bymeans of the mappings computed by the UVV builder, foreach keyword, the correspondent terms in the GLSO areextracted, if they exist. Such terms are then sent to XIREby means of a specific wrapper.

5.1.4. The wrappers

The MOMIS wrappers translate the source data structuresinto ODLI3 . Their role is to deal with the diversity of the datasources thus allowing MOMIS to pay no attention to thelanguage details of the different data sources representingevery type of data source in the same language. Wrappers areavailable for different kind of data sources, ranging fromdifferent database management systems to semi-structureddata like XML, RDF, OWL formats.

Wrappers logically guarantee two main operations:

�

8 http://jena.sourceforge.net/ontology/

getschema() translates the schema from the originalformat into ODLI3 , dealing with the necessary data typeconversions;

�
runquery() executes a query on the local source. TheMOMIS Query Manager translates a query on the DO (aglobal query) into a set of local queries to be locallyexecuted by means of wrappers.
5.1.5. The XIRE connector

The XIRE connector is the component in charge ofmanaging the interactions between the MOMIS and theXIRE systems. The connector mainly supports two tasks:

�
getGLSOschema() translates the schema of the GLSO(expressed in OWL) into ODLI3 . � serviceDiscovery() enables the discovery of the services
on the basis of a query on the DO. The MOMIS QueryManager translates a query on the DO (a global query)into a set of keywords that will be exploited by XIRE toget a list of Web Service relevant to the global query.

5.2. XIRE

This component includes four modules: (i) a thirdparty IR engine (in the prototype we choose Lucene, anopen source IR engine developed by Apache SoftwareFoundation [14], available at http://lucene.apache.org/),(ii) the GSO builder which is in charge of building the GSObased on the indexes built by the IR engine, (iii) theOntology Module Extractor module realizing the ontologymodularization algorithm described in Section 3.2.3 and(iv) the term extractor, which, at query time, is in charge ofextracting the set of weighted keywords from the terms ofthe GLSO. All these modules except for Lucene will bedescribed in the next subsections.

5.2.1. GSO builder

The GSO Builder is in charge of building the globalservice ontology starting from the list of terms extractedby the semantic Web Service descriptions provided by theLucene component. For each term, that is also a referenceto an ontological concept, the GSO Builder imports thewhole ontology associated to it. In such a way all relatedconcepts are added to the GSO. GSO Builder makes use ofthe Jena framework, in particular the OWL API.

5.2.2. Ontology Module Extractor

The Ontology Module Extractor is the module in chargeof lightening, at set-up time, the GSO for better handling;the module implements the PaNCH algorithm describedin Section 3.2.3. The algorithm has been implementedexploiting the Jena8 and the Protege APIs. After havingtested different configurations, the configuration used isset to k=1 and h=0; given a set of 278 input concepts,under the selected configuration 1779 concepts aredeleted from an input ontology of 2927 concepts (giventhe same input list and ontology, 837 concepts are deletedunder a configuration of k=2 and h=1, and 1578 conceptswere deleted under a configuration of k=2 and h=1).

http://lucene.apache.org/

http://jena.sourceforge.net/ontology/


6. Experiments

In this section an evaluation of the prototype described inSection 5 is presented. In the next subsection experimentalevaluations of the XIRE component are presented. In fact, inrelation to the data integration effectiveness, the perfor-mance of the MOMIS component has been evaluated in [15].In order to clearly describe the experiments done with theXIRE component, a preliminary consideration has to beintroduced. The effectiveness of an IR based approach ismeasured by means of two main measures, recall andprecision, which have to be assessed with respect to aspecific query. The precision is the proportion of relevantdocuments over the documents retrieved in response to aquery; the recall is the proportion of relevant documentsretrieved by the system over the documents truly relevant tothe considered query. The main elements of an IR experi-ment are then a collection of documents, a collection ofqueries, and for each query the knowledge of the documentswhich are truly relevant to it. As the IR based approach toWeb Service retrieval is a quite new approach to the locationof relevant Web Services, there exists no standard collectionwhich allows to fully evaluate the effectiveness of such anapproach. Due to the choice to use OWL-S as the referencesemantic description language, we used a large set of servicedescriptions given in the OWL-S Service Retrieval TestCollection 3 (OWLS-TC). In OWLS-TC9 [16]: several servicesare collected from the Web or from available repositories(OWLS-TC provides more than 1000 for the OWL-S 1.1language), several ontologies are referred to in the servicedescriptions (OWLS-TC provides 43 ontologies), and theseontologies may concern different domains (OWLS-TCprovides services from seven different domains). OWLS-TCis the only frequently used collection and thus also regularlycited in the literature. Notice that in the following we do notcompare XIRE’s results w.r.t. the WS matchmaker engines(see Section 7.2) due to the different a priori knowledgerequested. In fact while our approach starts from a simpleSQL-like query, WS matchmaker engines uses complex andricher descriptions that foresee specific knowledge about thespecific service to find. As a consequence any crosscomparison between XIRE and WS matchmaker engines isill defined. For a discussion about the different assumptionsbehind IR-based service retrieval and matchmaking, we referto Section 7.2. By these experiments we want to show thatbased on the analysis of the textual descriptions and of theconcepts specified in a Web Service definition, the XIREapproach is able to retrieve more relevant material than theone retrieved by a pure IR search engine.

6.1. The test case

We selected from the OWLS-TC benchmark 18 queriesrelated to four domains, namely Travel, Economy, Educa-tion, Food.10 Each query consists of a service descriptionto match, and is structured into three sections: input,

9 http://projects.semwebcentral.org/projects/owls-tc/10 The complete set of query and results are available at http://

siti-server01.siti.disco.unimib.it/itislab/dataserviceintegration/.

output and service description; while input and output

consists of sets of ontology concepts, the service descrip-

tion is the description of the query expressed in naturallanguage. Three data ontologies are built starting fromexisting resources for the above-mentioned domains: thefirst DO is the tourist integrated schema described inFig. 1; the second DO represents an integrated schema ofdata concerning products available in a mall, such as foodand different kinds of media (e.g. video, books and so on);the third DO represents an integrated schema in theeducation domain, concerning data about publications,universities, grants and so on. The total set of semanticWeb Service considered is equal to 866 (the services of thefour above-mentioned domains). Starting from thedescriptions of the 18 queries expressed in naturallanguage (service description section), we defined thecorresponding SQL queries by using the terminology ofthe DOs. As an example, the query ‘‘Which is the bestservice to know about the destination for surfing’’ (query25 of the benchmark) is translated into the SQL query:

select Name

from Location inner join Activity on

where Activity.Type = ’Surfing’

We conducted our experiments on a computerequipped with a Core2 T5600 1.83 GHz, 2 Gb of RAMand Windows XP Professional as operating system.

At set-up time the indexing of the set of SWS took672 ms, the GSO build from the terms extracted by Luceneis composed of 2927 concepts; the PaNCH algorithm isexecuted with an input list of 278 concepts and k=1 andh=0 as parameters because of the large size of both theGSO and the input list; the module extraction processproduced a GLSO consisting of 1148 concepts, deleting1779 concepts of the GSO, and took 98,797 ms. Theevaluation is based on two different software configura-tions: (i) basic search (BS) and (ii) semantic-mediatedsearch (SM). In the former configuration we manuallyidentify from the SQL queries the relevant keywords asinput for the IR engine by considering the terms used inthe SELECT, FROM, and WHERE clause of the query(e.g. keywords extracted from query 25 are name, location,

activity, type, and surfing). In particular, the keywords areused to query the service description section only of SWSdescription. In fact both input and output sections containontology descriptions and not simple keywords. In thesecond configuration we follow the whole proceduredescribed in Section 4. For each query and for eachconfiguration we calculate the precision and recall and wedraw the precision–recall graph.

In Fig. 6 the recall for each query in both configurationis reported, showing that our system based on semanticmediation (SM in the figure) produces results with betterrecall with respect to the basic-search approach (BS in thefigure). By analyzing results, we can see that when weconsider the whole prototype the recall is extremely high,always around 1. This is explained due to the fact that theapproach presented in this paper enriches the set ofkeywords extracted by SQL query by exploiting both DOand GLSO. Concerning the precision, our approach

http://projects.semwebcentral.org/projects/owls-tc/

http://siti-server01.siti.disco.unimib.it/itislab/dataserviceintegration/

http://siti-server01.siti.disco.unimib.it/itislab/dataserviceintegration/

recall SMrecall BS

1.2

1

0.8

0.6

0.4

0.2

04 6 7 12 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Fig. 6. Recall of all queries.

precision-semantic mediationprecision-Basic Search

0.90

0.80

0.50

0.60

0.70

0.40

0.20

0.30

0.00

0.10

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 7. Precision recall graph.


produces better results by pushing the relevant items inthe returned list of SWSs upwards in the rank. In Fig. 7 wereport the precision–recall graph over all queries, whichshows that our prototype is able to produce results withan average precision of 80% for a recall cut of 50%.

6.2. Result evaluation

Precision is increased thanks to the use of the structureof Web Services. In fact XIRE implement data structurewhich considers input, output and description part of eachSWS description. Another observation is that with respectto a basic IR-based approach the use of an externalontology provided by the UVV increases significantly theresult when there is heterogeneity of terms between theSQL query terms and the SWS terms. This supports basicassumption of this work that does not require to users toexpress the query using the terminology of the specificSWS.

7. Related work

Issues related to the aggregation of information fromdata and services have been tackled from differentperspectives that can be summarized in Data as a Service

and Service as Data approaches. In the first perspective,data are considered as a specific type of services [17] ableto provide information exposing some WSDL or RESTful(which are simple Web Services implemented using HTTPand the principles of REST—Representational State Trans-fer) [18] interface. In this approach data services areused as part of complex and value added processesrealized by Service Oriented Architectures (SOA) [19,20].In the service as data approach, informative services areconsidered as a kind of data source to be integrated withother ones, to provide a global view on the data sourcesmanaged in a networked organizations. To the best of ourknowledge this approach is completely new in the

literature. In our vision, XIRE can be adopted as a wrapperfor semantic Web Services. In fact, a generic wrapper hasto achieve two main tasks: (1) exporting data sourceschemas in terms of the conceptual model adopted by amediator, and (2) execute queries into specific data sourcequery language. XIRE deals with both the issues: firstly, itis able to expose the GLSO representing the metadatadescription of the SWS. Secondly, it is able to discoverrelevant services that can be automatically invoked byusing one of the techniques shown in [21–23].

Our approach may be classified as ‘‘aggregated search’’,i.e. the task of searching and assembling information froma variety of sources, placing it in a single interface [3].Nevertheless, this is a very young research field andtypically concerns data aggregations according to differ-ent criteria. Search computing [2] is another interestingdiscipline whose goal is to answer to complex, multi-domains queries. In this research activity the need is toaggregate multi-domain query provided by domainspecific search engine. In [24] authors propose anarchitecture for search computing and they introduce asimple classification of services. Exact services behave likerelational data sources as they return a set of unrankedanswers; search services return a list of answers inranking order according to some measure of relevance.According to this classification it is possible to classifyXIRE as a search service while MOMIS is an exact service.However, an important difference between search com-puting and our approach is that our approach providesquery answers in terms of data and related services. Thisis achieved by combining a data integration process witha Web Service retrieval system. In the following, weintroduce some related work on these topics.

7.1. Data integration systems

The research community has been investigating dataintegration for about 20 years: different research com-munities (database, artificial intelligence, semantic web)


have been developing and addressing issues related todata integration in different perspectives. Many differentapproaches have been proposed and a large numberof projects have been produced, making it difficult toprovide a classification of the previous work based oncomprehensive and shared criteria.

Mediator [25] represents one of the most studiedarchitecture for data integration, and is based on buildinga mediated schema as a synthesis of the source schemasto be integrated. By managing all the collected data in acommon way, a mediated schema allows the user to posea query according to a global perception of the handledinformation. A query over the mediated schema istranslated into a set of sub-queries for the involvedsources by means of automatic unfolding-rewritingoperations taking into account the mediated and thesources schemata. Results from sub-queries are finallyunified by data reconciliation techniques. One of the mostimportant aspects in the design of a mediator basedsystem is the specification of the correspondence betweenthe data at the sources and those in the mediated schema[26]. Two basic approaches for specifying mappings in adata integration system have been proposed in theliterature, called global-as-view (GAV), where the con-tents of the mediated schema is expressed in terms ofqueries over the sources, and local-as-view (LAV), basedon the idea that the contents of each source may berepresented in terms of a predefined mediated schema.The two approaches have been fused in a unique model,called global-local-as-view (GLAV) which synthesizes thecharacteristics of the two approaches. Roughly speaking,the main issues of the GAV approach are related tothe update of the mediated schema. If sources change, themediated schema could change with side effects on theapplications which refer to it. On the other hand, severalissues related to the query rewriting have to be faced forbuilding a Query Manager for LAV systems.

Several systems have been developed following theseapproaches (see [27] for a survey). In this paper weexploited the MOMIS system for providing an integratedview of data sources. This choice is mainly due to theapproach adopted by MOMIS that extracts semantics fromthe data sources, creating a repository that is exploited inthe following integration of data sources with the serviceontology. Moreover, the MOMIS system is one of the fewdata integration tools available as open source project.11

In particular, starting from the semi-automatic generatedmappings between global and local attributes stored inthe mapping tables, MOMIS defines views (global classes)by means of a predefined operator, i.e. the full disjunction,that has been recognized as providing a natural semanticsfor data merging queries. In the view definition resolutionfunctions are defined to take into account data conflicts.

Other architectures have been developed for managingin a unified manner several data sources. Among them,matching is an operation which takes two schemas asinput and produces a mapping between elements of thetwo schemas that correspond semantically to each other

11 See http://www.datariver.it.

[28]. Thus, matching differs from mediation since it doesnot build a global schema. On the other hand, mediatorsystems exploit matching techniques for executing theirtasks. Matching techniques exploit different strategies toobtain the correspondences [28,29].

The Clio project [30] pioneered the use of schemamappings, developing a tool for semi-automaticallycreating mappings between two data representations. Inthe Clio framework a source schema is mapped into adifferent, but fixed, ‘‘target’’ schema, while the focus ofour proposal is the semi-automatic generation of ‘‘target’’schema, i.e. the DO, starting from the sources. Moreover,the semi-automatic tool for creating schema mappings,developed in Clio, employs a mapping-by-example para-digm that relies on the use of value mappings, describinghow a value of a target attribute can be created from a setof values of source attributes. Our proposal for creatingschema mappings can be considered orthogonal withrespect to this paradigm. In fact, the main techniques ofmapping construction rely on the meanings of the classand attribute names selected by the designer in theannotation phase and by considering the semanticrelationships between meanings coming from the com-mon lexical ontology. On the other hand, MOMIS and CLIOshare a common mapping semantics among a (target)global schema and a set of source schemata expressed bythe full-disjunction operator.

Another problem relevant to the integration, and inparticular when large ontologies need to be integrated,is related to the size of the schema that can be handled. Inthis paper we solved this problem by adopting anautomatic technique to reduce the size of the large serviceschema.

Several approaches have been proposed to extractmodules from a given ontology. In particular, twodifferent classes of approaches have been proposed [6]:approaches that exploit traversal-based extraction tech-niques considering graph-based representations of ontol-ogies [31–34], and logic-based approaches grounded ondescription logics semantics [35–37]. In this paper weadopted a traversal-based extraction approach because, asdiscussed in [6], it is more useful when ontologies are notaxiomatically rich, and when it is important to havecontrol on the size of the resulting module. In thefollowing we briefly discuss the main differences betweenthe approach introduced here and the other traversal-based approaches w.r.t. this specific context in order tomotivate the introduction of the PaNCH algorithm.

As discussed in Section 3.2.3, in our context, theextracted module should preserve the features of theontology more relevant to the definition of semanticmappings, keeping a good control on the size of themodule; this is achieved by keeping the more generalconcepts of the ontology (and their hierarchy), and a set ofproperties related to these concepts (used in the map-pings between the GLSO and the DO); control on the sizeof the module is provided by adjusting k and h. More indetails, the index terms seldom occur in the bottom-mostlayers of the subclass hierarchies, which are populatedoften by concepts not relevant to the service descriptions.As for the traversal of the subclass hierarchy, we adopted

http://www.datariver.it


a full upward traversal as in [34]; indeed, we dropped thefull downward traversal proposed in the latter, andwe developed a parametric downward traversal strategy;the adopted approach is equivalent to the application ofthe Traversal Directive developed in [38] to the subclassrelationship. However, the latter approach does notconsider sibling concepts as we did; siblings are notconsidered in the module extraction approaches proposedby [32–34] either. Finally, paths between conceptsthrough property traversal as defined more or lessaccurately in [38,32–34] are not relevant in our case: infact, we do not exploit reasoning to support serviceretrieval. Instead, our algorithm specifically extractsthe properties that can be associated with the conceptsin the module; to the best of our knowledge, this selectiveextraction is peculiar to our approach.

7.2. Web service retrieval

The discovery of (semantic) Web Services is animportant open issue. In [39] author propose an IRapproach to Web Service discovery but they use thelexical database WordNet to consider the semanticmeaning of words, enabling a more precise degree ofsimilarity. Our approach, based on the use of a domainontology allow more accurate keyword definitions than ageneric lexical database such as WordNet.

The majority of discovery solutions were developed inthe context of automatic service composition. Thus, the‘‘client’’ of the discovery procedure is an automatedcomputer program with little, if any, tolerance to inexactresults. Service matchmaking is aimed at discovery ofservices satisfying a given set of requirements. Suchrequirements are usually specified in terms of typing onthe information exchanged by the service; in other words,matchmakers are usually targeted to find services whoseinputs (I), outputs (O), preconditions (P) and effects (E)match with a set of desired I/O/P/E; the original goal is toguarantee a correct exchange of messages between theselected service and the services it is supposed tointeroperate with. As a result, most of the approaches areoriginally based on logical I/O matching strategies, whichresult in different logical classes of matching, namely exact,plug-in, subsume (and subsumed-by), and logical fail.

Several approaches to service retrieval applicable toOWL-S service descriptions have been proposed andcompared against benchmarks.12 In the following wediscuss the relationship between our proposal and somematchmakers for OWL-S that implement non pure logicbased approaches, namely, OWLS-MX2 [16], OWLS-MX3[40], iMatcher2 [8], and Opossum [41].

The OWLS-MX2 and OWLS-MX3 matchmaker performservice selection based on logic-based I/O match and onnon logic-based text match; text match techniques arebased on syntactic similarity metrics such as the Cosine and

12 See http://www-ags.dfki.uni-sb.de/�klusch/s3/html/2009.html

where a summary of the S3 contest that provides means for evaluating

the retrieval performance of semantic Web Service matchmakers is

proposed.

Extended Jaccard [8]. In particular, the recent OWLS-MX3matchmaker exploits a semantic similarity metric able tocorrect some false positives and negatives resulting fromother matching strategies. This technique is similar to ourapproach to perform keyword expansion. Moreover theOWLS-MX3 matchmaker has adaptive capabilities thatmake it learn to optimally aggregate the results of differentmatching filters (off-line).

iMatcher2 performs hybrid approximate matchmakingby evaluating a vector-based similarity of the unfoldedservice signatures (the path from the root concept and theI/O concepts), and text-based similarities of the serviceunfolded signatures, the service names, and the servicedescriptions. Moreover, the iMatcher2 creates composedmatching strategies by machine learning algorithms. Thisapproach is based on the definition of queries describingthe desired features of the services (name, description,I/O). This approach is the one which is more similar to theapproach proposed in this paper but is still based on I/Odecomposition or on pure syntactic TDIDF similaritybetween the textual parts.

The Opossum Matchmaker is based on methods forsemantic indexing and approximate retrieval of WebServices. It relies on graph-based indexing in whichconnected services can be approximately composed,while graph distance (shortest path distance, conceptdepth/average ontology depth) represents service rele-vance. A query interface translates a user’s query,expressed as a set of keywords, into a virtual semanticWeb Service, which in turn is matched against indexedservices; therefore, although this is the only approachexploiting keyword-based queries, the retrieval process isstill based on the separate evaluation of I/O.

In order to compare our approach to the abovedescribed approaches, consider that, because of thematchmaking original goal, all the hybrid matchmakersmentioned above are based on queries whose coreconsists of I/O specifications. In other words, the inputsin the query are evaluated against the inputs in theservices, and outputs in the query are evaluated againstthe outputs in the services, while we are interested infounding a set of services related to a set of termsextracted from a query expressed in SQL, and we do notdistinguish between I/O.

Moreover, the mentioned matchmakers perform ap-proximate retrieval of OWL-S descriptions, while ourapproach differs is explicitly based on a ‘‘relevance’’-basedretrieval against a set of keywords. The retrieved servicesmight be relevant to the set of keywords w.r.t. theirinputs, outputs or descriptions. Our approach radicallydiffers from the other hybrid matchmakers since the latterare based on matchmaking algorithms, implementing IRtechniques such as TFIDF cosine coefficient and extendedJaccard metrics. Our approach enriches the vector spacemodel for indexing structured document with semantictechniques to expand the keywords and exploit theontology mappings.

7.2.1. IR models

Several works in the literature have addressed theproblem of defining IR models for structured documents

http://www-ags.dfki.uni-sb.de/&sim;klusch/s3/html/2009.html

http://www-ags.dfki.uni-sb.de/&sim;klusch/s3/html/2009.html


(e.g. [42,43]). Two main aspects have been investigatedregarding the management of structured documents, i.e.how to index structured documents so as to usefullyexploit their structure in their formal representation, andhow to define query evaluation mechanisms that canretrieve also document subparts, thus not considering thewhole document as the retrievable information unit.Passage retrieval is mainly concerned with the problemof retrieving subparts of a textual document as retrievableinformation units. One of the main aim of passageretrieval is to identify short blocks of relevant informationamongst irrelevant text [42,44]. Another approach to therepresentation and retrieval of structured documents isconstituted by what the authors call aggregation basedapproaches. These approaches represent or estimate therelevance of document subparts based on the aggregationof the representation or estimated relevance of their owncontent and the representation or estimated relevance oftheir structurally related parts [45].

8. Conclusion

In complex and value added process, users need bothto search data and to discover interesting related services,by means of a new generation of aggregated searchtechnology. Existing solution for aggregated search arerelated only to data. In this paper we present an originalcontribution in this field by proposing an integratedapproach for data and services discovery. Our approachcreates, at the design time, an integrated representation ofdata sources and semantic Web Service. The user mayquery such an integrated view thus obtaining a set of dataand related services. Preliminary results show that ourapproach ensures high recall levels and acceptable level ofprecision.

Future work will be devoted to increase the systemprecision. In particular, techniques for structured docu-ment retrieval will be extended for improving themappings among data and services. Moreover, we areplanning to consider other benchmarks for evaluating theresults. Another research direction is to push the keywordbased approach also at data level in order to offer a purekeyword-based search engine for both data and services,without using SQL as input model.

Acknowledgements

The work presented in this paper has been partiallysupported the Italian FIRB project RBNE05XYPWNeP4B—Networked Peers for Business.

References

[1] S. Bergamaschi, S. Castano, M. Vincini, D. Beneventano, Semanticintegration of heterogeneous information sources, Data Knowl. Eng.36 (3) (2001) 215–249.

[2] S. Ceri, Search computing, in: ICDE, IEEE, 2009, pp. 1–3.[3] V. Murdock, M. Lalmas, Workshop on aggregated search, SIGIR

Forum 42 (2) (2008) 80–83.

[4] D. Beneventano, S. Bergamaschi, C. Sartori, Description logics forsemantic query optimization in object-oriented database systems,ACM Trans. Database Syst. 28 (2003) 1–50.

[5] L. Po, S. Sorrentino, S. Bergamaschi, D. Beneventano, Lexicalknowledge extraction: an effective approach to schema andontology matching, in: Proceedings of the 10th European Con-ference on Knowledge Management, 3–4 September 2009, Uni-versita Degli Studi Di Padova, Vicenza, Italy, 2009, pp. 617–626.

[6] I. Palmisano, V. Tamma, T. Payne, P. Doran, Task oriented evaluationof module extraction techniques, in: 8th International SemanticWeb Conference (ISWC2009), Lecture Notes in Computer Science,vol. 5823, Springer, 2009, pp. 130–145.

[7] A. Bernstein, E. Kaufmann, C. Buerki, M. Klein, How similar is it?towards personalized similarity measures in ontologies, in: O.K.Ferstl, E.J. Sinz, S. Eckert, T. Isselhorst (Eds.), Wirtschaftsinformatik,Physica-Verlag, 2005, pp. 1347–1366.

[8] C. Kiefer, A. Bernstein, The creation and evaluation of iSPARQLstrategies for matchmaking, in: S. Bechhofer, M. Hauswirth,J. Hoffmann, M. Koubarakis (Eds.), The Semantic Web: Researchand Applications, Proceedings of the 5th European Semantic WebConference, ESWC 2008, Tenerife, Canary Islands, Spain, June 1–5,2008, Lecture Notes in Computer Science, vol. 5021, Springer, 2008,pp. 463–477.

[9] W.W. Cohen, P.D. Ravikumar, S.E. Fienberg, A comparison of stringdistance metrics for name-matching tasks, in: S. Kambhampati,C.A. Knoblock (Eds.), IIWeb, , 2003, pp. 73–78.

[10] D. Beneventano, S. Bergamaschi, Semantic search engines based ondata integration systems, in: J. Cardoso (Ed.), Semantic WebServices: Theory, Tools and Applications, Idea Group Publishing,2007, pp. 317–342.

[11] C.D. Manning, P. Raghavan, H. Schutze, Introduction to InformationRetrieval, Cambridge University Press, 2008.

[12] C.A. Galindo-Legaria, Outerjoins as disjunctions, in: R.T. Snodgrass,M. Winslett (Eds.), Proceedings of the 1994 ACM SIGMODInternational Conference on Management of Data, Minneapolis,Minnesota, May 24–27, ACM Press, 1994, pp. 348–358.

[13] F. Naumann, J.C. Freytag, U. Leser, Completeness of integratedinformation sources, Inf. Syst. 29 (7) (2004) 583–615.

[14] E. Hatcher, O. Gospodnetic, Lucene in Action, Manning Publications,2004.

[15] D. Beneventano, S. Bergamaschi, M. Vincini, M. Orsini, R.C. NanaMbinkeu, Getting through the THALIA benchmark with MOMIS, in:Proceedings of the Third International Workshop on DatabaseInteroperability (InterDB 2007) held in conjunction with the 33rdInternational Conference on Very Large Data Bases, VLDB 2007,Vienna, Austria, September 24, 2007.

[16] M. Klusch, B. Fries, K.P. Sycara, OWLS-MX: a hybrid semantic webservice matchmaker for OWL-S services, J. Web Sem. 7 (2) (2009)121–133.

[17] H.-L. Truong, S. Dustdar, On analyzing and specifying concerns fordata as a service, in: Proceedings of the IEEE Asian-Pacific ServiceComputing Conference, 2009, pp. 87–94.

[18] L. Richardson, S. Ruby, RESTful Web Services, O’Reilly, 2007.

[19] M. Hansen, S.E. Madnick, M. Siegel, Data integration using webservices, in: Proceedings of the VLDB 2002 Workshop EEXTT andCAiSE 2002 Workshop DTWeb on Efficiency and Effectiveness ofXML Tools and Techniques and Data Integration over the Web-Revised Papers, Springer-Verlag, London, UK, 2003, pp. 165–182.

[20] F. Zhu, M. Turner, I. Kotsiopoulos, K. Bennett, M. Russell, D. Budgen,P. Brereton, J. Keane, P. Layzell, M. Rigby, J. Xu, Dynamic dataintegration using web services, in: ICWS ’04: Proceedings of theIEEE International Conference on Web Services, IEEE ComputerSociety, Washington, DC, USA, 2004, p. 262 doi: http://dx.doi.org/10.1109/ICWS.2004.49.

[21] K. Sycara, M. Paolucci, A. Ankolekar, N. Srinivasan, Automateddiscovery interaction and composition of semantic web services,Web Semantics: Science, Services and Agents on the World WideWeb 1 (1) (2003) 27–46, doi: 10.1016/j.websem.2003.07.002.

[22] V. De Antonellis, M. Melchiori, L. De Santis, M. Mecella, E. Mussi,B. Pernici, P. Plebani, A layered architecture for flexible web serviceinvocation, Softw. Pract. Exp. 36 (2) (2006) 191–223 doi:http://dx.doi.org/10.1002/spe.v36:2.

[23] J. Kopecky, D. Roman, M. Moran, D. Fensel, Semantic web servicesgrounding, in: AICT-ICIW ’06: Proceedings of the AdvancedInternational Conference on Telecommunications and InternationalConference on Internet and Web Applications and Services, IEEEComputer Society, Washington, DC, USA, 2006, p. 127.

http://dx.doi.org/10.1109/ICWS.2004.49

http://dx.doi.org/10.1109/ICWS.2004.49

dx.doi.org/10.1016/j.websem.2003.07.002

http://dx.doi.org/10.1002/spe.v36:2

http://dx.doi.org/10.1002/spe.v36:2


[24] M. Brambilla, S. Ceri, Engineering search computing applications:vision and challenges, in: H. van Vliet, V. Issarny (Eds.), ESEC/SIGSOFT FSE, ACM, 2009, pp. 365–372.

[25] G. Wiederhold, Mediators in the architecture of future informationsystems, IEEE Comput. 25 (3) (1992) 38–49.

[26] M. Lenzerini, Data integration: a theoretical perspective, in: L. Popa(Ed.), PODS, ACM, 2002, pp. 233–246.

[27] A.Y. Halevy, A. Rajaraman, J.J. Ordille, Data integration: the teenageyears, in: U. Dayal, K.-Y. Whang, D.B. Lomet, G. Alonso,G.M. Lohman, M.L. Kersten, S.K. Cha, Y.-K. Kim (Eds.), VLDB, ACM,2006, pp. 9–16.

[28] E. Rahm, P.A. Bernstein, A survey of approaches to automaticschema matching, VLDB J. 10 (4) (2001) 334–350.

[29] Y. Velegrakis, R.J. Miller, L. Popa, Mapping adaptation underevolving schemas, in: VLDB, 2003, pp. 584–595.

[30] R. Fagin, L.M. Haas, M.A. Hernandez, R.J. Miller, L. Popa,Y. Velegrakis, Clio: schema mapping creation and data exchange,in: A. Borgida, V.K. Chaudhri, P. Giorgini, E.S.K. Yu (Eds.), ConceptualModeling: Foundations and Applications, Lecture Notes in Compu-ter Science, vol. 5600, Springer, 2009, pp. 198–236.

[31] N.F. Noy, M.A. Musen, Specifying ontology views by traversal, in:S.A. McIlraith, D. Plexousakis, F. van Harmelen (Eds.), InternationalSemantic Web Conference, Lecture Notes in Computer Science, vol.3298, Springer, 2004, pp. 713–725.

[32] P. Doran, V. Tamma, L. Iannone, Ontology module extraction forontology reuse: an ontology engineering perspective, in: CIKM ’07:Proceedings of the Sixteenth ACM Conference on Conference onInformation and Knowledge Management, ACM, New York, NY,USA, 2007, pp. 61–70.

[33] M. d’Aquin, P. Doran, E. Motta, V.A.M. Tamma, Towards aparametric ontology modularization framework based on graphtransformation, in: B.C. Grau, V. Honavar, A. Schlicht, F. Wolter(Eds.), ‘WoMO, CEUR Workshop Proceedings, vol. 315, CEUR-WS.org, 2007.

[34] J. Seidenberg, A.L. Rector, Web ontology segmentation: analysis,classification and use, in: L. Carr, D.D. Roure, A. Iyengar, C.A. Goble,M. Dahlin (Eds.), WWW, ACM, 2006, pp. 13–22.

[35] B.C. Grau, I. Horrocks, Y. Kazakov, U. Sattler, Just the rightamount: extracting modules from ontologies, in: Proceedings of

WWW-2007: the 16th International World Wide Web Conference,Banff, Alberta, Canada, May 8–12, 2007, 2007, pp. 717–726.

[36] E. Jimenez-Ruiz, B.C. Grau, U. Sattler, T. Schneider, R.B. Llavori, Safeand economic re-use of ontologies: a logic-based methodology andtool support, in: S. Bechhofer, M. Hauswirth, J. Hoffmann,M. Koubarakis (Eds.), The Semantic Web: Research and Applica-tions, Proceedings of the 5th European Semantic Web Conference,ESWC 2008, Tenerife, Canary Islands, Spain, June 1–5, 2008, LectureNotes in Computer Science, vol. 5021, Springer, 2008, pp. 185–199.

[37] Z. Wang, K. Wang, R.W. Topor, J.Z. Pan, Forgetting concepts in DL-Lite, in: S. Bechhofer, M. Hauswirth, J. Hoffmann, M. Koubarakis(Eds.), The Semantic Web: Research and Applications, Proceedingsof the 5th European Semantic Web Conference, ESWC 2008,Tenerife, Canary Islands, Spain, June 1–5, 2008, Lecture Notes inComputer Science, vol. 5021, Springer, 2008, pp. 245–257.

[38] N.F. Noy, Semantic integration: a survey of ontology-basedapproaches, SIGMOD Record 33 (4) (2004) 65–70.

[39] R. Sotolongo, C.A. Kobashikawa, F. Dong, K. Hirota, Algorithm forweb service discovery based on information retrieval usingwordnet and linear discriminant functions, JACIII 12 (2) (2008)182–189.

[40] M. Klusch, P. Kapahnke, OWLS-MX3: An adaptive hybrid semanticservice matchmaker for OWL-S, in: Proceedings of the ThirdInternational Workshop on Service Matchmaking and ResourceRetrieval in the Semantic Web, at the 8th International SemanticWeb Conference, Washington, DC, USA, CEUR Workshop Proceed-ings, vol. 525, CEUR-WS.org, 2009.

[41] E. Toch, A. Gal, I. Reinhartz-Berger, D. Dori, A semantic approach toapproximate service retrieval, ACM Trans. Internet Technol. 8 (1).

[42] J.P. Callan, Passage-level evidence in document retrieval, in: W.B.Croft, C.J. van Rijsbergen (Eds.), SIGIR, ACM/Springer, 1994, pp.302–310.

[43] Y. Chiaramella, Information retrieval and structured documents, in:M. Agosti, F. Crestani, G. Pasi (Eds.), ESSIR, Lecture Notes inComputer Science, vol. 1980, Springer, 2000, pp. 286–309.

[44] M. Kaszkiel, J. Zobel, Passage retrieval revisited, in: SIGIR, ACM,1997, pp. 178–185.

[45] G. Bordogna, Pasi, Personalized indexing and retrieval of hetero-geneous structured documents, Inf. Retrieval 8 (2) (2005) 301–318.

Documents

Aggregated search of data and services