36
A Graph-based Approach to Learn Semantic Descriptions of Data Sources Mohsen Taheriyan Craig Knoblock Pedro Szekely Jose Luis Ambite

A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

A Graph-based Approach to Learn Semantic Descriptions of Data Sources

Mohsen Taheriyan

Craig Knoblock

Pedro Szekely

Jose Luis Ambite

Page 2: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

Problem: How to learn semantic descriptions?

Page 3: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

First, what is a semantic description?

Page 4: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

4

Semantic DescriptionDescribing the source in terms of the concepts and relationships

defined by the domain ontology

Source

object propertydata propertysubClassOf

Domain Ontology

Person

Organization

Place

Statename

birthdatebornIn

worksFor state

name

phone

namelivesIn

CityEvent

ceolocation

organizer

nearby

startDate

endDatetitle

isPartOf

postalCode

Column 1 Column 2 Column 3 Column 4 Column 5Bill Gates Oct 1955 Microsoft Seattle WA

Mark Zuckerberg May 1984 Facebook White Plains NYLarry Page Mar 1973 Google East Lansing MI

Page 5: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

5

Semantic Types

Column 1 Column 2 Column 3 Column 4 Column 5

Bill Gates Oct 1955 Microsoft Seattle WA

Mark Zuckerberg May 1984 Facebook White Plains NY

Larry Page Mar 1973 Google East Lansing MI

Person Organization City State

name birthdate name namename

Person

Page 6: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

6

Relationships

Column 1 Column 2 Column 3 Column 4 Column 5

Bill Gates Oct 1955 Microsoft Seattle WA

Mark Zuckerberg May 1984 Facebook White Plains NY

Larry Page Mar 1973 Google East Lansing MI

Person

Organization

City

State

name birthdate

bornIn

worksForstate

name

name

name

This semantic model is converted to a semantic description in R2RML

Page 7: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

Previous approach to learn semantic descriptions

Page 8: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

8

Karma

Domain Ontology

Sample Data

LearnSemantic

Types

CRF

ExtractRelationships

Steiner Tree

Semantic Model

http://www.isi.edu/integration/karma @KarmaSemWeb

Page 9: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

9

Refining The Model

Initial Model

Page 10: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

10

Refining The Model

Refined Model

• Previous work does not learn the changes done by the user in relationships

• User has to go through the refinement process each time

Page 11: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

Our new approach to learn semantic descriptions

Page 12: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

12

Key Idea

• Sources in the same domain often have similar data

• Exploit knowledge of existing source models

• Leverage relationships in known source models to hypothesize relationships for new sources

Page 13: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

13

Approach

LearnSemantic

Types

CRF

S1 S2 Sn

Known Source Models

…Inputs

Generate Candidate Models Rank Results

Domain Ontology New Source

Construct Graph G

Page 14: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

14

Example

Person

Organization

City State

name birthdate

bornIn

worksFor

state

name

namename

name| city|birthdate| state|workplace

S1 = personalInfo

CityState

state

namename

state | cityS2 = getCities

Person

Organization

CityState

name

ceo

isPartOf

name

namename

company| city|ceo| state

S3 = businessInfo

location

Known Source Models

Domain Ontology

New Source

S4 = postalCodeLookup(zipcode, city, state)

Page 15: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

15

Build a Graph from Known Models

S1 = personalInfo

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1} {s1}

{s1}

{s1}{s1}

{s1}

Component 1

• Create a component in G for each known source model– Only add if the model is not subgraph of an existing component

• Annotate links with list of supporting models

Page 16: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

16

Build a Graph from Known Models

S1 = personalInfo

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

S2 = getCities

Component 1

• Create a component in G for each known source model– Only add if the model is not subgraph of an existing component

• Annotate links with list of supporting models

Page 17: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

17

Build a Graph from Known Models

S1 = personalInfo

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

S2 = getCities S3 = businessInfo

Person

Organization

CityState

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Component 1 Component 2

• Create a component in G for each known source model– Only add if the model is not subgraph of an existing component

• Annotate links with list of supporting models

Page 18: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

18

• Connect graph components using all paths inferred from the ontology

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

Person

Organization

CityState

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Event

Place

location

organizer

organizer

location

location

ceo

worksFor

isPartOf

isPartOf

isPartOf

Build a Graph from Known Models

isPartOf

Page 19: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

19

• Assign low weight = ε to links within a component (black links)

• Weight other links according to their (green links)

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

Person

Organization

CityState

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Event

Place

location

organizer

organizer

location

location

ceo

worksFor

isPartOf

isPartOf

isPartOf

Build a Graph from Known Models

M = known source modelsWmax = number of links in M (>= |EG|) = 18c1(e) = number of links in M whose <label,source, target> match ec2(e) = number of links in M whose <label> match ewe = Min(Wmax - c1 , Wmax - c2/Wmax)

18

17

17

17.9418

17.94

17.94

17.94

17.94 isPartOf

17.94

Page 20: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

20

Learn Semantic Types (Previous Work)

• A CRF-based model to assign a Semantic Type to each column from its data

• Semantic Type

– Ontology Class– Data Property + Domain

Domain Ontology

(zipcode , city , state)S4 = postalCodeLookup

Place.postalCode City.name State.name

Page 21: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

21

Generate Candidate Models

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

Person

Organization

CityState

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Event

Place

location

organizer

organizer

location

location

ceo

worksFor

isPartOf

isPartOf

isPartOf

18

17

17

17.9418

(zipcode, city, state)S4 = postalCodeLookup

Place.postalCode City.name State.name

• Map learned semantic types to nodes in graph G– There might be multiple mappings

• Compute Steiner tree (minimal tree) for each mapping

17.94

17.94

17.94

17.94 isPartOf

17.94

Page 22: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

22

Generate Candidate Models • Map learned semantic types to nodes in graph G

– There might be multiple mappings

• Compute Steiner tree (minimal tree) for each mapping

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

Person

Organization

CityState

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Event

Place

location

organizer

organizer

location

location

ceo

worksFor

isPartOf

isPartOf

isPartOf

18

17

17

18

Place.postalCode

postalCode

(zipcode, city, state)S4 = postalCodeLookup

Place.postalCode City.name State.name

Mapping 1

17.94

17.94

17.94

17.94

17.94 isPartOf

17.94

Page 23: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

23

Generate Candidate Models

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

Person

Organization

CityState

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Event

Place

location

organizer

organizer

location

location

ceo

worksFor

isPartOf

isPartOf

isPartOf

18

17

17

17.9418

Place.postalCode

postalCode

• Map learned semantic types to nodes in graph G– There might be multiple mappings

• Compute Steiner tree (minimal tree) for each mapping

(zipcode, city, state)S4 = postalCodeLookup

Place.postalCode City.name State.name

Mapping 1

17.94

17.94

17.94

17.94 isPartOf

17.94

Page 24: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

24

Generate Candidate Models • Map learned semantic types to nodes in graph G

– There might be multiple mappings

• Compute Steiner tree (minimal tree) for each mapping

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

Person

Organization

City State

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Event

Place

location

organizer

organizer

location

location

ceo

worksFor

isPartOf

isPartOf

isPartOf

18

17

17

18

Place.postalCode

postalCode

(zipcode, city, state)S4 = postalCodeLookup

Place.postalCode City.name State.name

Mapping 2

17.94

17.94

17.94

17.94

17.94 isPartOf

17.94

Page 25: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

25

Generate Candidate Models

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

Person

Organization

City State

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Event

Place

location

organizer

organizer

location

location

ceo

worksFor

isPartOf

isPartOf

isPartOf

18

17

17

17.94

17.9418

Place.postalCode

postalCode

• Map learned semantic types to nodes in graph G– There might be multiple mappings

• Compute Steiner tree (minimal tree) for each mapping

(zipcode, city, state)S4 = postalCodeLookup

Place.postalCode City.name State.name

Mapping 2

17.94

17.94

17.94 isPartOf

17.94

Page 26: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

26

Generate Candidate Models • Map learned semantic types to nodes in graph G

– There might be multiple mappings

• Compute Steiner tree (minimal tree) for each mapping

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

Person

Organization

CityState

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Event

Place

location

organizer

organizer

location

location

ceo

worksFor

isPartOf

isPartOf

isPartOf

18

17

17

18

Place.postalCode

postalCode

(zipcode, city, state)S4 = postalCodeLookup

Place.postalCode City.name State.name

Mapping 3

17.94

17.94

17.94

17.94

17.94 isPartOf

17.94

Page 27: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

27

Generate Candidate Models

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

Person

Organization

CityState

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Event

Place

location

organizer

organizer

location

location

ceo

worksFor

isPartOf

isPartOf

isPartOf

18

17

17

18

Place.postalCode

postalCode

isPartOf

• Map learned semantic types to nodes in graph G– There might be multiple mappings

• Compute Steiner tree (minimal tree) for each mapping

(zipcode, city, state)S4 = postalCodeLookup

Place.postalCode City.name State.name

Mapping 3

17.94

17.94

17.94

17.94

17.94

17.94

Page 28: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

28

Generate Candidate Models • Map learned semantic types to nodes in graph G

– There might be multiple mappings

• Compute Steiner tree (minimal tree) for each mapping

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

Person

Organization

City State

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Event

Place

location

organizer

organizer

location

location

ceo

worksFor

isPartOf

isPartOf

isPartOf

18

17

17

18

Place.postalCode

postalCode

(zipcode, city, state)S4 = postalCodeLookup

Place.postalCode City.name State.name

Mapping 4

isPartOf

17.94

17.94

17.94

17.94

17.94

17.94

Page 29: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

29

Generate Candidate Models

Person

Organization

City State

namebirthdate

bornInworksFor

state

name

name

name

Person.name City.name

Person.birthdate

State.name

Org.name{s1}

{s1}

{s1,s2} {s1,s2}

{s1}

{s1}{s1}

{s1,s2}

Person

Organization

City State

name

ceo

isPartOf

namename

name

location

Org.name

Person.name

City.nameState.name

{s3}{s3}

{s3}

{s3}

{s3}

{s3}

{s3}

Event

Place

location

organizer

organizer

location

location

ceo

worksFor

isPartOf

isPartOf

isPartOf

18

17

17

18

Place.postalCode

postalCode

• Map learned semantic types to nodes in graph G– There might be multiple mappings

• Compute Steiner tree (minimal tree) for each mapping

(zipcode, city, state)S4 = postalCodeLookup

Place.postalCode City.name State.name

Mapping 4

isPartOf

17.94

17.94

17.94

17.94

17.94

17.94

Page 30: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

30

Rank Source Models• Rank the candidates based on:

– Cost: sum of the weights– Coherence: prefer the models with higher number of supporting models

Place

City State

postalCode

isPartOfstate

namename

Place.postalCode

City.name

State.name

{s1,s2} {s1,s2}

{s1,s2}

PlaceCity

State

postalCode

isPartOf

isPartOf

namename

Place.postalCode

City.name

State.name

{s1,s2} {s3}

Place

City State

postalCode

isPartOfisPartOf

namename

Place.postalCode

City.name

State.name

{s3} {s3}

{s3}

PlaceCity

State

postalCode

isPartOf

isPartOf

namename

Place.postalCode

City.name

State.name{s3}

{s1,s2}

Rank 1: Candidate 1 Rank 2: Candidate 4

Rank 3: Candidate 2 Rank 3: Candidate 3

Page 31: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

31

Evaluation• Dataset 1

– 17 data sources containing overlapping data– Semantic descriptions created manually using DBPedia, FOAF,

GeoNames, and WGS84 ontologies

• Dataset 2– 6 museum sources– Semantic descriptions created by domain experts using EDM,

SKOS, and FOAF ontologies

• Learned a source model assuming other models as input• Computed the Graph Edit Distance (GED) between the learned

model and the correct one – Operations: node insertion, node deletion, edge insertion, edge

deletion, edge relabeling

• Compared the results with our previous work in Karma

Page 32: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

32

Results - Dataset 1

Source Signature #Attributes

GED

Previous work

New Approach(Rank 1)

nearestCity(lat, lng, city, state, country) 5 6 1findRestaurant(zipcode, restaurantName, phone, address) 4 1 0zipcodesInCity(city, state, postalCode) 3 3 1parseAddress(address, city, state, zipcode, country) 5 6 1citiesOfState(state, city) 2 1 0ocean(lat, lng, name) 3 2 1postalCodeLookup(zipCode, city, state, country) 4 6 1country(lat, lng, code, name) 4 2 0companyCEO(company, name) 2 1 0personalInfo(firstname, lastname, birthdate, brithCity, birthCountry) 5 4 1businessInfo(company, phone, homepage, city, country, name) 6 10 8restaurantChef(restaurant, firstname, lastname) 3 2 1findSchool(city, state, name, code, homepage, ranking, dean) 7 8 6employees(organization, firstname, lastname, birthdate) 4 1 2education(person, hometown, homecountry, school, city, country) 6 9 4administrativeDistrict(city, province, country) 3 4 1capital(country, city) 2 2 1TOTAL 68 68 29

57% improvement

Page 33: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

33

Results - Dataset 2

Source Signature #Attributes

GED

Previous work

New Approach(Rank 1)

S1(Attribution, BeginDate, EndDate, Title, Dated, Medium, Dimensions) 7 1 0

S2(ObjectID, ObjectTitle, ObjectWorkType, ArtistName, ArtistBirthDate, ArtistDeathDate, ObjectEarliestDate, ObjectRights, ObjectFacetValue1)

8 2 3

S3(death, birth, name) 3 0 0

S4(accessionNumber, artist, creditLine, dimensions, imageURL, materials, relatedArtworksURL, creationDate, provenance, keywordValues)

10 9 6

S5(AccessionNumber, Classification, CreditLine, Date, Description, DimensionsOrphan, WhatValues, Who, image, relatedArtworksValues)

10 9 5

S6(Artist, ArtistBornDate, ArtistDiedDate, Classification, Copyright, CreditLine, Image, KeywordValues, Ref, SitterValues) 10 8 6

TOTAL 68 29 20

31% improvement

Page 34: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

34

Related Work• Writing semantic descriptions by hand

– R2RML, SWRL– Tedious and time-consuming task– Requires expertise in SW technologies

• Semantic annotation of Web services and Web tables– Very limited in learning the relationships

• Learning Semantic Definitions of Online Information Sources [Carman, Knoblock, 2007]– Learns LAV rules from known sources– Can only learn descriptions that overlap known sources

Page 35: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

35

Discussion

• Automatically build rich semantic descriptions of data sources

• Exploit the background knowledge from (i) the domain ontology, and (ii) the known source models

• Semantic descriptions are the key ingredients to automate many tasks, e.g., – Source Discovery – Data Integration– Service Composition

Page 36: A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

36

Future Work• Investigate how to create a more compact graph

– Consolidate the overlapping segments of the known semantic models

• Relax the problem by removing the constraint that the correct semantic type of each attribute is known– CRF part returns a set of candidate semantic types along with their

confidence values

• Use the data available in Linked Open Data (LOD) cloud to learn more accurate models

• Put the user in the loop– Integrate the new approach into Karma

– The user refines one of the suggested models

– The new model will be added to the graph as a new pattern