Semantic Search Engines

D. Beneventano and S. Bergamaschi - Semantic Search Engines based on Data Integration Systems 1

Semantic Search Engines Semantic Search Engines based on based on Data Integration SystemsData Integration Systems

Chapter 13


Agenda Agenda

Semantic Search Engines : Motivation

Semantic Search Engines : Ingredients

The SEWASIE project

Architecture of the SEWASIE system

Building the SEWASIE system ontology

Querying the SEWASIE system

An architectural evolution of SEWASIE : WISDOM

Conclusion and Future Work


Motivation Motivation Semantic Search Engines try to augment and improve

traditional Web Search Engines by using not just words, but concepts and logical relationships.

Ingredients for develop Semantic Search Engines with good performance: Data Integration Systems, Domain Ontologies, and Peer-to-Peer architectures

We will provide empirical evidence for our hypothesis: we will describe two projects, SEWASIE and WISDOM, which rely on these architectural features and developed key semantic search functionalities.

They both exploit the MOMIS data integration system.


Ingredients for Semantic Search Engines (1)Ingredients for Semantic Search Engines (1) Data Integration Systems

Data Integration: to combine data residing at different autonomous sources, and providing the user with a unified view of these data.

Data Integration Systems: are characterized by a wrapper/mediator architecture based on a Global Virtual Schema (Global Virtual View - GVV) and a set of data sources: The data sources contain the real data, while the GVV provides a reconciled, integrated, and virtual view of the underlying sources.

Domain Ontologies In the Semantic Web, the data is associated with descriptions

with a formal semantics, defined in terms of ontologies

Given a set of data sources related to a domain Data integration provides a GVV that is a conceptualization (domain ontology) describing the involved sources.


Ingredients for Semantic Search Engines (2)Ingredients for Semantic Search Engines (2) Peer-to-Peer architectures

Schema based P2P networks : combine approaches from P2P as well as from the data integration and semantic web research areas. Such networks build upon peers that use metadata (ontologies) to describe their contents and semantic mappings among concepts of different peers’ ontologies.

Peer Data Management Systems : each node can be a data source, a mediator system, or both; a mediator node performs the semantic integration of a set of information sources to derive a global schema of the acquired information.

Super-peer networks : metadata for a small group of peers is centralized onto a single super-peer; a super-peer is a node that acts as a centralized server to a subset of clients.

Semantic overlay clustering approach : aims at creating logical layers above the physical network topology, by matching semantic information provided by peers to clusters of nodes.


MOMISMOMIS The MOMIS (Mediator envirOnment for Multiple Information

Sources) is a framework to perform information extraction and integration from both structured and semistructured data sources. (www.dbgroup.unimo.it/Momis)

Information integration is performed in a semi-automatic way, by exploiting the knowledge in a Common Thesaurus and descriptions of source schemas with a combination of clustering techniques and Description Logics. An object-oriented language, with an underlying Description Logic,

called ODLI3, is introduced for information extraction The integration process gives rise to a virtual integrated view of

the underlying sources: it is thus possible to synthesize a domain ontology (GVV) of a set of data sources related to a domain.

MOMIS follows a Global-As-View (GAV) approach where the GVV and the mappings among the local sources and the GVV are defined in a semi-automatic way.


SEWASIE SEWASIE SEWASIE - SEmantic Webs and AgentS in Integrated Economies

(www.sewasie.org) is a project funded by EU on action line Semantic Web (2002-2005)

In SEWASIE the schema-based and super-peer network approaches are combined, that is a schema-based super-peer network organized into a two-level architecture: Peer level: a peer contains a data integration system, which

integrates heterogeneous data sources into an ontology composed of: an annotated GVV and Mappings to the source schemas.

Super-peer level: a super-peer contains a integration system, which integrates the GVV of its peers into an ontology composed of a GVV of the peers GVVs and Mappings to the GVVs of its peers.

A novel approach for defining the ontology of the super-peer and querying the peer network is introduced.

The search engine has been fully exploiting agent technology


WISDOMWISDOM WISDOM - Web Intelligent Search based on DOMain ontologies

(www.dbgroup.unimo.it/wisdom) is an italian MIUR-PRIN project (2004-2006)

WISDOM is based on an overlay network of semantic peers, where each peer contains a mediator-based integration system. Key feature is a distributed architecture based on the P2P paradigm and the adoption of domain ontologies.

Two level of integration of information sources: Lower Level - Strong integration : a semantic peer contains a

data integration system, which integrates heterogeneous data sources into a domain ontology composed of: an annotated GVV and Mappings to the data source schemas.

Upper Level - Loose integration : a network of peers with semantic mappings among the ontologies of a set of semantic peer

When a query is posed against one given peer, it is suitably propagated towards other peers among the network of mappings.


Agenda Agenda



The SEWASIE project







The SEWASIE architectureThe SEWASIE architecture

Query Results

QueryAgent

QueryAgent

QueryAgent

BrokeringAgent (BA)

BAOntology

Monitoring

Agent (MA)

Query Tool InterfaceOLAPTool

SINode

StructuredDatabases

RDBs

Wrapper

Query

MetadataRepository

Semi-Databases

Wrapper

<XML><DATA>...</DATA>

Wrapper

UnstructuredText documents

<HTML>...

StructuredDatabases

RDBs

Wrapper

QueryManager

Ontology

Databases

Wrapper

<XML><DATA>...

Wrapper

<HTML>...

Ontology

Builder

StructuredSINode

SINode

SEWASIE Interconnectio

ninfrastructure

BA

BA

BABA

BrokeringAgent (BA)


SEWASIE - Goal SEWASIE - Goal We propose a novel approach (implemented in SEWASIE) for

querying a super-peer within a schema-based super-peer network focusing on querying a single BA

We have two different levels of mappings:

The first mapping (m1) is at the BA level and maps several GVVs of SINodes to the GVV of the BA;

the second mapping (m2) is within an SINode and maps the data sources into the GVV of an SINode.

Query answering can be carried out in terms of two reformulation steps

1. Reformulation w.r.t. the BA ontology (mapping m1);

2. Reformulation w.r.t. the SINode ontology (mapping m2).


The two different levels of mappingThe two different levels of mapping


The two-level The two-level data integration systemdata integration system An Integration System IS = (GVV,N,M) is constituted by:

A GVV, which is a schema in ODLI3, with is-a relationships and both key and foreign key constraints.

A set N of local sources; each local source has an ODLI3 schema. A set M of GAV mapping assertions between GVV and N, where

each assertion associates to an element g in GVV a query qN over the schemas of a set of local sources in N.

For each global class C of the GVV we define:1. a (possibly empty) set of local classes, denoted by L(C),

belonging to the local sources in N.2. a conjunctive query qN over L(C).

A SEWASIE system is constituted by: A set of SINodes SN = {SN1, SN2, . . . , SNn}, where each SINode

is a IS = (GVV,N,M), with N a set of data sources. A Brokering Agent BA, which is an IS = (GVV,N,M) where N = SN,

i.e., the sources of BA are the SINodes.


Integration System: SemanticsIntegration System: Semantics SEWASIE: GVV contains integrity constraints, and sources

are considered sound (but not necessarily complete).

When the global schema contains integrity constraints, even of simple forms, the semantics of the data integration system is best described in terms of a set of databases, rather than a single one, and this implies that query processing is intimately connected to the notion of querying incomplete databases.

Traditional data integration systems (MOMIS) follow one of the following strategies: they either express the global schema as a set of plain relations without integrity constraints, or they consider the sources as exact, as opposed to sound. [Calvanese et al - KR2004] D. Calvanese, G. D. Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. What to ask to a peer: Ontolgoy-based query reformulation. KR 2004


Agenda Agenda



The SEWASIE project







Building the SEWASIE system Building the SEWASIE system ontology (GVV)ontology (GVV) MOMIS/SEWASIE allows the integration designer semi-

automatically building a GVV starting from a set of local sources: Ontology Builder. The integration process exploits: Schema derived relationships Lexicon derived relationships Description Logics techniques for generating new relationships Clustering techniques for grouping similar contents of local

sources The process gives rise to a Mapping Table (MT) for each

global class C of GVV, whose columns represent the local classes L(C) belonging to C and whose rows represent the global attributes of C. An element MT[GA][LC] represents the set of local attributes of

LC which are mapped onto the global attribute GA.


AUTOMATIC/ MANUALANNOTATION

SEMI-AUTOMATICANNOTATION

INFERRED RELATIONSHIPS

LEXICON DERIVEDRELATIONSHIPS

SCHEMA DERIVEDRELATIONSHIPS

CommonThesaurus

COMMON THESAURUSGENERATION

USER SUPPLIEDRELATIONSHIPS

ODLI3LOCAL SCHEMA N

WRAPPING

ODLI3LOCAL SCHEMA 1

…

GVV GENERATION

MAPPING TABLES

GLOBAL SCHEMA (ODLI3)

clustersgeneration

Structuredsource

RDB

<XML>

<DATA>

Semi-StructuredSource

SYNSET1 SYNSET#

SYNSET2

WNEditor

Overview of the GVV-generation processOverview of the GVV-generation process


Example of mapping tableExample of mapping table


Building the Mappings: qBuilding the Mappings: qN N definition definition

GAV mappings: for each global class C of the GVV we must define a query qN over the local classes of C.

Starting from the Mapping Table of C, the integration designer, supported by the Ontology Builder graphical interface, can implicitly define qN by:

1. using and extending the Mapping Table with

• Data Conversion Functions from local to global attributes

• Join Conditions among pairs of local classes belonging to C

• Resolution Functions for global attributes to solve data conflicts of local attribute values.

2. using and extending the Full Disjunction operator, that has been recognized as providing a natural semantics for data merging queries


from T_SN1 full join T_SN2

Join Attribute

on (T_SN1.COMPANY_ID = T_SN2.COMPANY_ID)

Join Conditions

FullDisjunction

...

ADDRESSADDRESSADDRESS

REGIONREGIONREGION

CAPITAL_STOCKCAPITAL_STOCK

SUBCONTRATORSUBCONTRATOR

COMPANY_ID, COUNTRY_IDCOMPANY_IDCOMPANY_ID

SN2.companySN1.company

Select COMPANY_ID, precedence(T_SN1.ADDRESS, T_SN2.ADRESS) as Address, T_SN2.SUBCONTRACTOR, …

Resolution Functions

Precedence(SN1,SN2)

Building the Mappings: an exampleBuilding the Mappings: an example


Data Conversion FunctionsData Conversion Functions The designer defines how local attributes are mapped onto

the global attribute GA by means of Data Conversion Functions:

for each not null element MT[GA][L], a not a null a Data Conversion Function, denoted by MTF[GA][L], which represents how the local attributes of L are mapped into the global attribute GA is defined.

MTF[GA][L] is a function executable by the local source L. For example, for relational sources, MTF[GA][L] is an SQL value expression; the following defaults hold:

T(L) denotes L transformed by the Data Conversion Function; the schema of T(L) is composed of the global attributes GA such that MT[GA][L] is not null.


Join ConditionsJoin Conditions Object Identification: Merging data from different

sources requires different representations of the same real world object to be identified.

Join Conditions: To identify instances of the same object and fuse them among pairs of local classes belonging to the same global class.

Given two local classes L1 and L2 belonging to C, a Join Condition between L1 and L2, denoted with JC(L1,L2), is an expression over L1.Ai and L2.Aj where Ai (Aj) are global attributes with a not null mapping in L1 (L2).

As an example, for BA-GVV.Company the designer can define JC(SN1.Company,SN1.Company) :SN1.Company.COMPANY ID = SN2.Company.COMPANY ID


Resolution FunctionsResolution Functions The fusion of data coming from different sources has to take

into account the problem of inconsistent information among sources.

MOMIS/SEWASIE adopts Resolution Functions. A Resolution Function may be defined for each global attribute

mapping onto local attributes coming from several sources, to solve data conflicts due to different local attribute values.

Homogeneous Attributes : If there are no data conflicts for a global attribute mapped onto more than one source

As an example, in BA-GVV.Company, we define all the global attributes as Homogeneous Attributes except for Address where we used a precedence function:SN1.Company.ADDRESS has a higher precedence than SN2.Company.ADDRESS


Full DisjunctionFull Disjunction

Full Disjunction (FD) [Galindo Legaria-SIGMOD1994] and [Rajarama, Ullman - PODS 1996] “computing the natural outer-join of many relations preserving all possible connections among facts”

Given a global class C composed of L1,L2, ..., Ln we consider

FD(T(L1), T(L2), . . . , T(Ln)) computed on the basis of the Join Conditions

where T(L) denotes L transformed by the Data Conversion Function, i.e., the full disjunction operator is applied after data conversion.


Full Disjunction Computation (1/2) Full Disjunction Computation (1/2) [Rajarama, Ullman - PODS 1996] : There is a natural outerjoin

sequence producing FD if and only if the set of relation schemes forms a connected, acyclic hypergraph. (with two relations, FD corresponds to the full (outer) join)

A Global Class C with more than 2 local classes is a cyclic hypergraph

new method

Moreover, we consider the requirement that qN has to contain a unique tuple merging all the tuples representing the same real world object.

Example with n = 3 :

L1

L2 L3

JC(L1,L3)JC(L1,L2)

JC(L2,L3)


Full Disjunction Computation (2/2)Full Disjunction Computation (2/2) The computation of FD is performed assuming:

1. each L contains a key, 2. all the join conditions are on key attributes, 3. all the join attributes are mapped into the same set of

global attributes (K).

It can be demonstrated that: 1. K is a key of C, and 2. FD can be computed as (FDExpr):

T(L1) full join T(L2) on JC(L1,L2))full join T(L3) on (JC(L1,L3) OR JC(L2,L3)) ...full join T(Ln) on (JC(L1,Ln) OR JC(L2,Ln) OR ...OR JC(Ln-1,Ln))


Agenda Agenda



The SEWASIE project







Querying the SEWASIE systemQuerying the SEWASIE system A SEWASIE system is a two-level data integrated system :

Mapping m1, among the SINode-GVVs and the BA-GVV; Mapping m2, among the source schemas and the SINode-GVV.

Halevy et al [VLDB2003] showed that, in general, the mapping from the data source schemas to the BA-GVV is not simply the composition of m1 and m2; Fagin et al [SIGMOD2004] showed that second order logic is needed to express composition.

Calvanese et al [KR2004] proved that if m1 and m2 are GAV mappings, the mapping is indeed the composition of m1 and m2; this implies that query answering can be carried out in terms of two reformulation steps:

1. w.r.t. the BA-ontology (BA-GVV + mapping m1);2. w.r.t. the SINode-ontology (SINode-GVV + mapping m2).

These reformulation steps are similar: in the following we will discuss the reformulation w.r.t. the BA-ontology


Query ReformulationQuery Reformulation Query expansion ([Calvanese et al - KR2004])

The query on the BA-GVV is expanded by taking into account the constraints in the BA-GVV: all constraints in the ontology are “compiled in” the expansion, so that the expanded query (EXPQuery) can be processed by ignoring constraints – this is the first technique of this kind in the data integration literature, as all other approaches to GAV data integration are based on just unfolding (which is an incomplete technique in our case)

Subqueries (EXPAtoms) are extracted from EXPQuery. An EXPAtom is a Single Class Query, i.e., a query on a single Global Class of the BA-GVV.

Query unfolding (for single class queries) Each EXPAtom is unfolded by considering the mappings in the

BA-Ontology, so that it is rewritten w.r.t. the SINode-GVVs. In the following we will discuss the unfolding process of an

EXPAtom by taking into account the new approach to define qN.


Query unfoldingQuery unfolding Given a global class C of the BA-GVV, with classes L1,L2, . . .Ln, we

consider a Single Global Query (SGQ) Q over a C: Q = select <Q_select-list> from C where <Q_condition><Q_condition> is a Boolean expression of atomic constraints: (GA1 op value) or (GA1 op GA2), GA1 and GA2 are attributes of C. Example:EXPATOM = SELECT NAME,CAPITAL_STOCK,REGION,ADDRESS,SUBCONTRACTOR

FROM companyWHERE CAPITAL_STOCK>50 AND REGION LIKE ’VENETO’ AND

SUBCONTRACTOR LIKE ’yes’ The output of the query unfolding process is

1. a set of SCQs (FDAtoms) over the SINodes GVVs: FDAtom = select <select-list> from SINode.C where <condition>

where C is a Global Class of the SINode-GVV.2. the FDExpr which computes the Full Disjunction of the FDAtoms3. the resolution functions of the attributes in <select-list>

The query unfolding process is made up of the following steps: (1) Atomic constraint mapping; (2) Select-list computation


Query unfolding : Query unfolding : Atomic constraint mappingAtomic constraint mapping Each atomic constraint of Q is rewritten into one that can be

supported by the local class. The atomic constraint mapping is performed on the basis of the

mapping functions defined in the Mapping Table. The atomic constraint mapping depends on the definition of the Resolution Functions for global attributes.

Non Homogeneous Attributes : For example, if we use the AVG function as resolution function for GA, the constraint (GA = value) cannot be pushed at the local sources, because of the AVG function has to be calculated at a global level, the constraint may be globally true but locally false.

In this case, the constraint is mapped as true in the local class. Homogeneous Attributes : An atomic constraint (GA op

value) is mapped onto the local class L as follows: (MTF[GA][L] op value) if MT[GA][L] is not null and the op operator is supported into L true otherwise


Query unfolding : Query unfolding : Select-list computationSelect-list computation The select-list of a FDAtom over the local class L is computed by

considering the union of1. the attributes in <Q_select-list> with a not null mapping in L2. the set of attributes used to express the join conditions for L3. the attributes in <Q_condition> with a not null mapping in L

For example, the set of FDAtoms for expatom is :FDATOM1 : SELECT COMPANY_ID, NAME, REGION, ADDRESS, CAPITAL_STOCKFROM SN1.companyWHERE ((CAPITAL_STOCK) > (50) and (REGION) like (’VENETO’))

FDATOM2 :SELECT COMPANY_ID, COUNTRY_ID, NAME,

REGION, ADDRESS, SUBCONTRACTORFROM SN2.companyWHERE (REGION) like (’VENETO’) and (SUBCONTRACTOR) like (’yes’)


Query unfolding : Query unfolding : FDExpr and Resolution FunctionsFDExpr and Resolution Functions The FDExpr to compute the FD of FDAtom1 and FDAtom2 is:

FDATOM1 full join FDATOM2 on (FDATOM1.COMPANY_ID=FDATOM2.COMPANY_ID)

The unfolded query is then obtained by applying to each query attribute of FDExpr, the related Resolution Function: for Homogeneous Attributes (e.g. REGION) one of the related

values is taken; for non Homogeneous Attributes (e.g. ADDRESS) the related

Resolution Function is applied.

After the query reformulation process, we have to consider query processing techniques to evaluate queries over our two-level data integration system.In the following we show the agent-based prototype developed for the SEWASIE Query Management.


SEWASIE Query Management: functional architectureSEWASIE Query Management: functional architecture

SINodeAgent1

Query

UNFOLDER

ExpAtoms

EXPANDER

PLAY MAKERBROKERING AGENT

BAOntology

QUERY AGENT

EN

D U

SER

QU

ER

Y

TO

OL

SEWASIE_DB

Expanded Query: EXPQuery

ExpAtoms Unfolding: FDExpr,FDAtoms, ResFunctions

Query

Result

EXECUTION

+

FUSION

+

FINAL RESULT

FDAtomsFDAtoms

Answers toFDAtoms

Answers toFDAtoms

Map Keeper

BBA-GVV

SINodeAgent2

mapping


UNFOLDERLibrarian

SINodeAgent2

SINodeAgent1

Query

ExpAtoms

EXPANDER


BAOntology

QUERY AGENT

EN

D U

SER

QU

ER

Y

TO

OL

SEWASIE_DB

SEWASIE Query Management: EXPANDERSEWASIE Query Management: EXPANDER

scq1: SELECT CATEGORY_ID FROM Mould_Making scq2: SELECT NAME,COMPANY_ID,CAPITAL_STOCK, REGION,SUBCONTRACTOR,ADDRESS FROM company WHERE CAPITAL_STOCK > 50 AND AND REGION LIKE 'VENETO' AND SUBCONTRACTOR LIKE ’yes’scq3: ...


EXPQuery:SELECT r2.NAME,r2.ADDRESS,r2.NATION FROM scq1 r1,scq2 r2,scq3 r3 WHERE r1.CATEGORY_ID=r3.CATEGORY_ID

AND r2.COMPANY_ID=r3.COMPANY_IDUNIONSELECT r2.NAME,r2.ADDRESS,r2.NATION FROM scq4 r1,scq2 r2,scq3 r3 WHERE …UNION …

Query


UNFOLDER

SewasieRepository

Query

ExpAtoms


ExpAtoms Unfolding: FDQuery,FDAtoms, ResFunctions

EXPANDER


BAOntology

QUERY AGENT

Query

FDAtom2: SELECT COMPANY_ID,COUNTRY_ID,NAME, REGION,ADDRESS, SUBCONTRACTOR FROM company WHERE ((REGION) like ('VENETO') and (SUBCONTRACTOR) like ('yes'))FDAtom1:

...

Full Disjunction:FDQuery: SELECT * FROM FDAtom1 OUTER JOIN FDAtom1

ON (FDAtom1.COMPANY_ID = FDAtom2.COMPANY_ID)

SEWASIE Query Management: UNFOLDERSEWASIE Query Management: UNFOLDER

scq2: SELECT NAME,COMPANY_ID,CAPITAL_STOCK, REGION,SUBCONTRACTOR,ADDRESS FROM company WHERE CAPITAL_STOCK > 50 AND AND REGION LIKE 'VENETO' AND SUBCONTRACTOR LIKE ’yes’

Resolution Function: precedence(${SI-NMAgent2.company.ADDRESS},${SI-NMAgent1.company.ADDRESS})

EN

D U

SER

QU

ER

Y

TO

OL


UNFOLDERLibrarian

SINodeAgent2

SINodeAgent1

Query

ExpAtoms


ExpAtoms Unfolding: FDQuery,FDAtoms, ResFunctions

EXPANDER


BAOntology

QUERY AGENT

Query

SEWASIE_DB

EN

D U

SER

QU

ER

Y

TO

OL

The Query Agent – coordination of query processingAccepts the query from the End User Query Tool, interacts with both the BA and the SINode Agents, and returns the result to the End User Query Tool


UNFOLDERLibrarian

SINodeAgent2

SINodeAgent1

EXPANDER


BAOntology

QUERY AGENT

The Query Agent : EXECUTIONThe Query Agent : EXECUTION

1. EXECUTIONFor each FDAtom (Parallel Execution): INPUT: FDAtom MESSAGES: from QA to SINode Agent OUTPUT:

a table storing the FDAtom result in the SEWASIE_DB

EXECUTION

FDAtoms

Answer to FDAtoms

FDAtoms

Answer to FDAtoms

SEWASIE_DB

EN

D U

SER

QU

ER

Y

TO

OL


UNFOLDERLibrarian

SINodeAgent2

SINodeAgent1

EXPANDER


BAOntology

QUERY AGENT

EXECUTION

FUSION

2. FUSION For each EXPATom (Parallel Execution): INPUT : FDAtoms, FDQuery,

Resolution Functions1. Execution of FDQuery

(Full Disjunction of the FDAtoms)2. Application of the Resolution Functions

on the result of previous action OUTPUT:

a view storing the EXPAtom result in the SEWASIE_DB

The Query Agent : FUSIONThe Query Agent : FUSION

SEWASIE_DB

EN

D U

SER

QU

ER

Y

TO

OL


UNFOLDERLibrarian

SINodeAgent2

SINodeAgent1

SEWASIE_DB

EXPANDER


BAOntology

QUERY AGENT

EXECUTION

FUSION

FINAL RESULT

3. FINAL RESULT

INPUT : Output of the FUSION step

1. Execution of the Expanded Query

OUTPUT : Final Query result view stored in the SEWASIE_DB

The Query Agent : FINAL RESULTThe Query Agent : FINAL RESULT

EN

D U

SER

QU

ER

Y

TO

OL


Querying SEWASIE: InterfaceQuerying SEWASIE: Interface(available at www.sewasie.org)(available at www.sewasie.org)














Agenda Agenda



The SEWASIE project







Every information source Sij is associated with a wrapper Wij , whose goal is to make the data access method transparent to the upper layers.

A wrapper offers a logical schema Sij against which the upper layers can pose queries.

Global Virtual View

Data source schema

Wrapper

Web source

WISDOM:WISDOM:Semantic peer and WrappersSemantic peer and Wrappers


WISDOM: WISDOM: Semantic peer networkSemantic peer network


A semantic peer-to-peer mappingsemantic peer-to-peer mapping, denoted Mi,j, is a relationship between the ontology Onti of the semantic peer Pi, and the ontology Ontj of the semantic peer Pj.

By means of p2p mappings, a query at a peer can be ideally extended to each peer for which a mapping is defined.

It is not always convenient to propagate a query to anyany peer for which a mapping exists.

We associate every peer-to-peer

mapping with a content summary.content summary.

Given a pair of semantic peers for which it exists a peer-to-peer mapping, the content summary associated with such a mapping provides quantitative information about the extension of the concepts in the source ontology that can be found through the mapping in the target semantic peer.

Peer-to-Peer Mapping and Peer-to-Peer Mapping and Query ProcessingQuery Processing


Wrapping Large Web SitesWrapping Large Web Sites A large number of Web sites contain highly structured highly structured

regionsregions. These sites represent rich and up-to-date information sources, which could be used to populate WISDOM semantic peers.

Several researchers have recently developed techniques developed techniques to to automatically inferautomatically infer web wrappers (extract data from HTML pages). Many web sites contain large collections of structurally similar pages.

The main problems, which significantly affect the scalability of the wrapper approach, are how to identify the structured regions of the target site, and how to collect the sample pages to feed the wrapper generation process. Based on such a site model we can infer a library of library of

wrapperswrappers.The model, together with the wrappers, can then be used to continuously extract data from the target web site.


Web site

Site model

Given a large web siteweb site composed by thousands of interconnected page,

we aim at producing modelmodel, that describes at the intensional level

the structure of the site.

Wrapping Large Web SitesWrapping Large Web Sites


Query Processing : formulationQuery Processing : formulation To ease the user in the task of formulating queries, a

graphical user interface is provided that allows queries to be specified with respect to the ontology of the peer the user is connected to (“target ontology”).

Besides specifying conditions that objects have to satisfy, a user query might also include preferencespreferences.

The result of a query Q with a preference specification pref is the set of objects, reachable from the target peer by navigating its mappings, that better comply with pref.


Query rewriting and peers selectionQuery rewriting and peers selection

Content summaryContent summary (CS) deal the problem of selecting only relevant peers

CS is a synopsis of the source peer contents In the simplest form a CS includes the cardinalities, in

the source peer extension, of the concepts in the target ontology. This is recursively extended to include also information on the extensions that can be found navigating the network through the source peer

The output is a set of “ranked rewritings” R1,...,Rm for the original query Q, with rewriting R1 being reputed the “most promising” one to return relevant results.

At Web scale [giving a complete answer to every query] is unfeasible and query execution must move to a probabilistic world of

evidence accumulation and away from exact answers.


Query Processing : executionQuery Processing : execution The approach to query execution inspires to works

developed for joining ranked inputs, that have been applied to databases and information retrieval systems We have a set of data sources, each one ranking objects

according to a specific locallocal criterion; we wish to determine the overall best objects: those objects which are ranked higher with respect to a globalglobal criterion

In a network of peers things get more complex, and techniques are properly extended to deal with this increased complexity

A query posed against the GVV retrieves data from the integrated source: according to the GAV strategy

Queries are unfolded by taking into account the view Qn The results defined from the subqueries onto the local

schema is integrated and reconcilated in a global answer on the basis of Qn


ConclusionConclusion We discussed some ingredients for developing Semantic

Search Engines based on Data Integration Systems and peer-to-peer architectures.

SEWASIE

Techniques for Building a (super-)peer ontology (GVV and Mappings)

Techniques for Querying a super-peer

WISDOM

Semantic peer network Peer-to-Peer Mapping and Query Processing (content

summary) Query rewriting and peers selection


Future WorkFuture Work SEWASIE

To manage the evolution of a super-peer ontology Managing the evolution of a peer ontology which integrates a set of peers, is an important feature in a peer-to-peer network, where peers can appear and disappear very frequently from the network.

To investigate efficient query processing techniques to evaluate queries over two-level data integration systems.

WISDOM improving the previous proposal by providing a framework for

building an ontology customized for a set of information sources and annotating them according the built ontology.

Technology

Semantic Search Engines