56
I. F. Cruz Data Mining and Exploration Middleware Workshop 1 Semantic Data Integration Isabel F. Cruz University of Illinois at Chicago

Semantic Data Integration - UIC Computer Scienceifc/Talks/Minnesota/cruz.pdf · DBMS Spatial Data Files ... • Single ‘virtual’ global data model • Abstraction of all underlying

  • Upload
    lynhi

  • View
    221

  • Download
    5

Embed Size (px)

Citation preview

I. F. Cruz Data Mining and Exploration Middleware Workshop 1

Semantic Data Integration

Isabel F. CruzUniversity of Illinois at Chicago

I. F. Cruz Data Mining and Exploration Middleware Workshop 2

OverviewMotivation: Two ApplicationsData Heterogeneity, Interoperability, and Data Integration Declarative Data Integration in GISTwo ApplicationsConclusions

I. F. Cruz Data Mining and Exploration Middleware Workshop 3

Application 1 – Presidential Elections 2000

Build aggregates of the data for the fifty states, at different granularity levels: e.g., county, municipality, wardElection data is in XMLEach state has a different structure for its election resultsSingle interface to display the results in all the states

I. F. Cruz Data Mining and Exploration Middleware Workshop 4

Application 2 – Land UseWLIS (Wisconsin Land Information System): web-based system linking data from distributed, heterogeneous data sourcesCase study: land use codesSample query: “Find all the agricultural lands in Dane and Racine counties.”Different authorities use different land use coding systems leading to syntactic and semantic heterogeneities

FEDERAL

STATE

REGIONAL PLANNING COMMISSION

COUNTY

VILLAGETOWNCITY

I. F. Cruz Data Mining and Exploration Middleware Workshop 5

OverviewMotivation: Two ApplicationsData Heterogeneity, Interoperability, and Data Integration Declarative Data Integration in GISTwo ApplicationsConclusions

I. F. Cruz Data Mining and Exploration Middleware Workshop 6

Heterogeneity

Modeling Power

Types of Heterogeneities[Bishr 99]

Syntactic

Schematic

Semantic

I. F. Cruz Data Mining and Exploration Middleware Workshop 7

Heterogeneity

Modeling Power

Types of Heterogeneities[Bishr 99]

Syntactic

Schematic

Semantic

Different data models Different data formats Unifying models, formats

I. F. Cruz Data Mining and Exploration Middleware Workshop 8

Heterogeneity

Modeling Power

Types of Heterogeneities[Bishr 99]

Syntactic

Schematic

Semantic

Different schemasSchema integration

I. F. Cruz Data Mining and Exploration Middleware Workshop 9

Heterogeneity

Modeling Power

Types of Heterogeneities[Bishr 99]

Syntactic

Schematic

SemanticNaming heterogeneity: same entity, different namesMapping using thesaurusCognitive heterogeneity: entity performs multiple roles in different contextsMapping using rules and constraints

I. F. Cruz Data Mining and Exploration Middleware Workshop 10

Levels of Interoperability[Bishr 99]

System A System B

Application Semantics

Data Model

DBMS

Spatial Data Files

Hardware & OS

Network Protocols

INTEROPERABILITY

Application Semantics

Data Model

DBMS

Spatial Data Files

Hardware & OS

Network Protocols Syst

em-le

vel t

o Se

man

tics

I. F. Cruz Data Mining and Exploration Middleware Workshop 11

Levels of Interoperability[Bishr 99]

System A System B

Application Semantics

Data Model

DBMS

Spatial Data Files

Hardware & OS

Network Protocols

INTEROPERABILITY

Application Semantics

Data Model

DBMS

Spatial Data Files

Hardware & OS

Network Protocols Syst

em-le

vel t

o Se

man

tics

• Single ‘virtual’ global data model • Abstraction of all underlying remote databases• Users need not know the location and format of data• Users need to know semantics of the data

I. F. Cruz Data Mining and Exploration Middleware Workshop 12

Levels of Interoperability[Bishr 99]

System A System B

Application Semantics

Data Model

DBMS

Spatial Data Files

Hardware & OS

Network Protocols

INTEROPERABILITY

Application Semantics

Data Model

DBMS

Spatial Data Files

Hardware & OS

Network Protocols Syst

em-le

vel t

o Se

man

tics

• Single ‘virtual’ global data model • Abstraction of all underlying remote databases• Users need not know the location and format of data• Users need not know semantics of the data

I. F. Cruz Data Mining and Exploration Middleware Workshop 13

Global Schema

Source

[Lenzerini 2001]

Application

Local Schema Local Schema

SourceSource

Query

Wrapper Wrapper

Data Integration

Local SchemaWrapper

Mediator

I. F. Cruz Data Mining and Exploration Middleware Workshop 14

Global Schema

Source

Application

Local Schema Local Schema

SourceSource

Query

Wrapper Wrapper

Data Integration

Local SchemaWrapper

MediatorOntology

[Ludäscher et al. 2001]

I. F. Cruz Data Mining and Exploration Middleware Workshop 15

Ontology

Source

Application

Local Ontology Local Ontology

SourceSource

Query

Wrapper Wrapper

Data Integration

Local OntologyWrapper

Mediator

[Fonseca & Egenhofer 99]

I. F. Cruz Data Mining and Exploration Middleware Workshop 16

Global schema consists of views over local schemas

Global querying is easy – sub-query unfolding

Global schema maintenance is difficult

Global Schema

Local Source

Local Source

queryquery MAPPING

Global-As-View (GAV) Model

I. F. Cruz Data Mining and Exploration Middleware Workshop 17

Each local source corresponds to a query over the global schema

Global querying is difficult –inference over partial answers

Global schema maintenance is easy

Global Schema

Local Source

Local Source

queryquery MAPPING

Local-As-View (LAV) Model

I. F. Cruz Data Mining and Exploration Middleware Workshop 18

Modified LAV Model

Commercial

Retail Sales Retail Services

Commercial

Intensive Non-intensive

Ontology Local Ontology

Limited update to the ontologyAlternative classifications are possibleHowever, currently only possible for simple cases

I. F. Cruz Data Mining and Exploration Middleware Workshop 19

Modified LAV ModelLimited update to the ontologyAlternative classifications are possibleHowever, currently only possible for simple cases

Commercial

Retail Sales Retail Services Intensive Non-

intensive

Modified Ontology

I. F. Cruz Data Mining and Exploration Middleware Workshop 20

Semantic Web“The Web of data (and connections) with meaning in the sense that a computer program can learn enough about what the data means to process it . . .

. . . Imagine what computers can understandwhen there is a vast tangle of interconnected terms and data that can automatically be followed.” (Tim

Berners-Lee, Weaving the Web, 1999)

I. F. Cruz Data Mining and Exploration Middleware Workshop 21

Unicode URI

XML + NS + xmlschema

RDF + rdfschema

Ontology Vocabulary

Logic

Proof

TrustD

igita

l Sig

natu

re

Self –described document

Data

Data

Rules

Semantic Web: Architecture(Berners-Lee http://www.w3.org/2000/Talks/1206-xml2k-tbl/)

I. F. Cruz Data Mining and Exploration Middleware Workshop 22

OverviewMotivation: Two ApplicationsData Heterogeneity, Interoperability, and Data Integration Declarative Data Integration in GISTwo ApplicationsConclusions

I. F. Cruz Data Mining and Exploration Middleware Workshop 23

ApproachOntology-driven– Reduces the problem of knowing the contents and the structure of each

data source to the smaller problem of knowing how entities in such sources are mapped to the ontology

– Can replace global schema in a restricted application domain.Declarative rules– Define mappings between concepts– Easy to define and to maintain– Expressed in XML as agreements

(Modified) Local-As-View– Recognizes autonomy of local sources

Query Processing – Leverages standard semantic web technology

I. F. Cruz Data Mining and Exploration Middleware Workshop 24

Ontology

XMLSource

Application

Local Ontology Local Ontology

XMLSource

XMLSource

Query

Wrapper Wrapper

Data Integration

Local OntologyWrapper

MediatorAgreements

I. F. Cruz Data Mining and Exploration Middleware Workshop 25

Agreement DocumentXML document that act as a wrapper layer for the underlying local data sourceStores information about how entities in the ontology map to the entities in the local data source Uses XML to capture the hierarchical ordering of entities and their mappingsSupports query operations using XPath/XSLT to hide details of how data is structured in local data sourceMinimizes need for programmer intervention and maintenance as it is declaratively specified

I. F. Cruz Data Mining and Exploration Middleware Workshop 26

Mapping RulesOne to oneOne to nullParent to childrenOne to manyMany to one

Encoded using XPath/XSLT Expression Templates in the Agreement Document

I. F. Cruz Data Mining and Exploration Middleware Workshop 27

Query Processing – Step 1

XSLT File

User Query

Agreement Document

arguments

references

XPath Expression Template

XSLT File

XPath Expression

I. F. Cruz Data Mining and Exploration Middleware Workshop 28

Query Processing – Step 2

XSLT File

DOM Tree

Apache XSLT Processor

XPath Expression Selected DOM Nodes

Accessor and Aggregation Operations

I. F. Cruz Data Mining and Exploration Middleware Workshop 29

OverviewMotivation: Two ApplicationsData Heterogeneity, Interoperability, and Data Integration Declarative Data Integration in GISTwo ApplicationsConclusions

I. F. Cruz Data Mining and Exploration Middleware Workshop 30

<state name="Illinois">...<county name="ADAMS">

<candidate name="Gore" votes="12197"/><candidate name="Bush" votes="17331"/><candidate name="Nader" votes="371"/><candidate name="Buchanan" votes="140"/><candidate name="Browne" votes="63"/><candidate name="Hagelin" votes="7"/><candidate name="Phillips" votes="0"/><candidate name="McReynolds" votes="0"/><candidate name="Total" votes="30109"/>

</county>...</state>

Application 1 – Presidential Elections 2000

I. F. Cruz Data Mining and Exploration Middleware Workshop 31

Mapping RulesOntology

State

County County

MunicipalityMunicipality

Wisconsin State Hierarchy

State

County County

Municipality

Wardgroup

Municipality

Wardgroup Wardgroup

Children mapping

One-to-one mapping

One-to-one mapping

I. F. Cruz Data Mining and Exploration Middleware Workshop 32

Agreement Document Fragment<xpath-expr>

/state[@name='$state_name']/county[@name='$county_name']/*</xpath-expr>. . . <argument>

<name>$state_name</name> <value>attribute::name</value>

</argument>

<argument><name>$county_name</name> <value>attribute::name</value>

</argument>. . .<operation name="getCandidateNames" aggregation="sset">

<body>child::candidate/attribute::name

</body></operation>

I. F. Cruz Data Mining and Exploration Middleware Workshop 33

Query Interface

I. F. Cruz Data Mining and Exploration Middleware Workshop 34

There are 72 counties and hundreds of cities and towns in the state; each may have their own system of classifying Land Use codes

Land Use CodeLand Use Code

Land Use Code

Land Use Code

Land Use Code

Application 2 – Land Use

I. F. Cruz Data Mining and Exploration Middleware Workshop 35

Heterogeneity occurs at all levels

Parcel-based example

Each highlighted parcel has its own land use classification code

I. F. Cruz Data Mining and Exploration Middleware Workshop 36

Dane County

Commercial

Retail ServicesRetail Sales

Racine County

Commercial

Retail Sales and Services

NonintensiveIntensiveLand Under

Development

Classification Semantic IssueQuery: Find all Commercial – Sales areas in Racine and Dane County

I. F. Cruz Data Mining and Exploration Middleware Workshop 37

Classification Scheme:Exhaustive Model

009 Shopping Center010 Open Water111 Single Family113 Two Family115 Multiple Family116 Farm Unit129 Group Quarters

140 Mobile Home142 Mobile Home Park190 Seasonal Residence021 Food and Kindred022 Textile and Mill023 Apparel and Related024 Lumber and Wood

I. F. Cruz Data Mining and Exploration Middleware Workshop 38

Classification Scheme:Hierarchical Model

1 Urban and Developed Land1.01 Residential

1.01.01 Single Family Detached or Duplex1.01.02 Mobile Homes Not in Parks1.01.03 Multi-family Dwellings

1.01.03.01 Three Unit Multi-family1.01.03.02 Four Unit Multi-family1.01.03.03 Five or More Multi-family

I. F. Cruz Data Mining and Exploration Middleware Workshop 39

Heterogeneity of Land Use Coding Systems

Classification for Cropland

Farms8110Lu_4_4City of Madison

General AgricultureAALu1Eau Claire County

Other Agriculture815

Cropland Pasture811TagRacine County (SEWRPC)

Cropland Pasture91LucodeDane County RPC

DescriptionLand Use CodeAttributePlanning Authority

I. F. Cruz Data Mining and Exploration Middleware Workshop 40

Heterogeneity of Land Use Coding Systems

Classification for Cropland

Farms8110Lu_4_4City of Madison

General AgricultureAALu1Eau Claire County

Other Agriculture815

Cropland Pasture811TagRacine County (SEWRPC)

Cropland Pasture91LucodeDane County RPC

DescriptionLand Use CodeAttributePlanning Authority

Synonyms

I. F. Cruz Data Mining and Exploration Middleware Workshop 41

Heterogeneity of Land Use Coding Systems

Classification for Cropland

Farms8110Lu_4_4City of Madison

General AgricultureAALu1Eau Claire County

Other Agriculture815

Cropland Pasture811TagRacine County (SEWRPC)

Cropland Pasture91LucodeDane County RPC

DescriptionLand Use CodeAttributePlanning Authority

Synonyms Two Land Use Codes

I. F. Cruz Data Mining and Exploration Middleware Workshop 42

Land Use Mapping (1)

Land Use Code

Agriculture(OA) Industrial (OI)

35

Ontology Dane County Hierarchy

Manufacturing (OIM) Others(OIO)

Land Use Code

Industrial (21-39)

36 21 22

Agriculture(91-99)

Code Explanations:

35 – Scientific Instruments 21 – Food and kindred

36 – Miscellaneous Industrial 22 – Textile and mill

One-to-many mapping

I. F. Cruz Data Mining and Exploration Middleware Workshop 43

Land Use Mapping (2)

Land Use Code

Agriculture(OA) Industrial (OI)

Ontology Dane County Hierarchy

Manufacturing (OIM)

Land Use Code

Industrial (21-39)

31

Agriculture(91-99)

Plastics (OIML) Rubber (OIMR)

Many-to-one mapping

I. F. Cruz Data Mining and Exploration Middleware Workshop 44

Agreement Document Fragment

<attrvalue id=“OIO” mapping=“one-to-many”><localvalue> 35 </localvalue><localvalue> 36 </localvalue>

</attrvalue

‘OIO’ (Industrial – Others) is equivalent to the collection of ’35’ (Scientific Instruments) and ’36’ (Miscellaneous Industrial)

One-to-many

<attrvalue id=“OAWN” mapping=“one-to-null” />

‘OAWN’ (Agriculture – Woodlands – Non-forest) does not have an equivalent

One-to-null

<attrvalue id=“OAC” mapping=“one-to-one” equiv=“91”/>

‘OAC’ (Agriculture – Croplands and Pasture) is directly mapped to the land use code ’91’(Cropland and Pasture)

One-to-one

Example from Dane CountyMapping Type

I. F. Cruz Data Mining and Exploration Middleware Workshop 45

Agreement Document Fragment

<attrvalue id=“OIML” mapping=“many-to-one”equiv=“31”/><attrvalue id=“OIMR” mapping=“many-to-one”equiv=“31”/>

‘OIML’(Industrial-Manufacturing-Plastics) and ‘OIMR’ (Industrial-Manufacturing-Rubber) both are mapped to ’31’(Manufacturing-Rubber and Plastics)

Many-to-one

<attrvalue id=“OAWF” mapping=“children”><attrvalue id=“OAWFC” equiv=“94”/><attrvalue id=“OAWFN” equiv=“99”/>

</attrvalue>

‘OAWFC’ (Agriculture – Woodlands – Forest – Commercial) and ‘OAWFN’ (Agriculture – Woodlands – Forest –Noncommercial) are equivalent to ‘94’ and ’99’ respectively

‘OAWF’ (Agriculture – Woodlands – Forest) is equivalent to the collection of ‘OAWFC’ and ‘OAWFN’

Parent-Children

Example from Dane CountyMapping Type

I. F. Cruz Data Mining and Exploration Middleware Workshop 46

L114Agriculture -Commercial Forest

94

Parcel ID(s)DescriptionLocal Land Use Code

99 L117, L119Agriculture –Woodlands (non-commercial forest)

Dane County ParcelsQuery “Find all parcels classified as Agriculture – Woodlands – Forests

L114

L117

L119

I. F. Cruz Data Mining and Exploration Middleware Workshop 47

Query Interface

I. F. Cruz Data Mining and Exploration Middleware Workshop 48

Agreement Maker

Visual interface for creating agreements easilyExisting mappings displayed to the user Displayed list of mappings updated as user identifies more mappings

I. F. Cruz Data Mining and Exploration Middleware Workshop 49

User Interface

I. F. Cruz Data Mining and Exploration Middleware Workshop 50

OverviewMotivation: Two ApplicationsData Heterogeneity, Interoperability, and Data Integration Declarative Data Integration in GISTwo ApplicationsConclusions

I. F. Cruz Data Mining and Exploration Middleware Workshop 51

Advantages of approachBoth problems do not require advanced tools like rule engines Using just XML to represent the Ontology and the mappings is enough for these applicationsMost of the processing is within the XSLT or XPath engine (including rule resolution).Ontology representation and query processing evolved between applicationsCan be maintained easily without programming knowledgeLocal-as-View approach did not pose problems

I. F. Cruz Data Mining and Exploration Middleware Workshop 52

SummaryData Integration for real-world GIS problems:– Semantic connections among data– Declarative approach– Available standards and tools– Building block for higher levels of integration – Identification of the right “level of complexity”

I. F. Cruz Data Mining and Exploration Middleware Workshop 53

Current/Future WorkIdentification of other types of mappingsOntology mapping and alignment [Cruz & Rajendran2003]Integration across multiple themesDesign of middleware components for semantic data integrationSchematic data integration using semantic information [Cruz & Xiao 2003]Extension to XQuery to handle semantic data heterogeneity [Wiegand et al 2002]

I. F. Cruz Data Mining and Exploration Middleware Workshop 54

Domainspace Concept in XQuery

DOMAINSPACE Area= “www. co.dane.wi.us/Dane.xml,www.co.racine.wi.us/Racine.xml”

<Result>{FOR $b IN document (Area) // *WHERE $b/Area.LandUseCode = “cropland/pasture”RETURN$b

}</Result>

I. F. Cruz Data Mining and Exploration Middleware Workshop 55

AcknowledgmentsSupport provided by – NSF Digital Government Grant – ARDA-NIMA– NSF ITR Grant for Context-aware Computing

At UIC: Afsheen Rajendran, Huiyong Xiao, and William Sunna.At the U. of Wisconsin-Madison: Nancy Wiegand, Steve Ventura, Dan Patterson and Naijun ZhouAxiomap Visualization Tool: Ilya Zaslavsky, ChaitanBaru

I. F. Cruz Data Mining and Exploration Middleware Workshop 56

ReferencesObject Interoperability for Geospatial Applications: A Case Study, I. F. Cruz and P. Calnan. "The Emerging Semantic Web," IOS Press, 2002 (see also www.semanticweb.org/SWDB). Handling Semantic Heterogeneities Using Declarative Agreements, I. F. Cruz, A. Rajendran, W. Sunna, and N. Wiegand. ACM GIS, 168-174, 2002.Querying Heterogeneous Land Use Data Over the Web, N. Wiegand, N. Zhou, I. F. Cruz, A. Rajendran, GIScience 2002.Semantic Data Integration in Hierarchical Domains, I. F. Cruz and A. Rajendran. Semantic Data Integration in Hierarchical Domains, I. F. Cruz and A. Rajendran. IEEE Intelligent Systems 18(2): 66-73, 2003.Exploring a New Approach to the Alignment of Ontologies, I. F. Cruz and A. Rajendran. Workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data, 2003.Using a Layered Approach for Interoperability on the Semantic Web, I.F. Cruz and H. Xiao. WISE, 2003.