19
Semantic Web Research at University of Texas at Dallas (Schema Matching + Storage & Retrieval of RDF graph) Faculties: Latifur Khan Bhavani Thuraisingham

Semantic Web Research at University of Texas at Dallas (Schema Matching + Storage & Retrieval of RDF graph) Faculties: Latifur Khan Bhavani Thuraisingham

Embed Size (px)

Citation preview

Page 1: Semantic Web Research at University of Texas at Dallas (Schema Matching + Storage & Retrieval of RDF graph) Faculties: Latifur Khan Bhavani Thuraisingham

Semantic Web Research at University of Texas at Dallas(Schema Matching + Storage & Retrieval of RDF graph)

Faculties: Latifur KhanBhavani Thuraisingham

Page 2: Semantic Web Research at University of Texas at Dallas (Schema Matching + Storage & Retrieval of RDF graph) Faculties: Latifur Khan Bhavani Thuraisingham

Semantic Matching in the GIS Domain

Jeffrey Partyka (Ph.D. Student)

Faculties: Funded byLatifur KhanBhavani Thuraisingham

Page 3: Semantic Web Research at University of Texas at Dallas (Schema Matching + Storage & Retrieval of RDF graph) Faculties: Latifur Khan Bhavani Thuraisingham

Schema Matching•Performing semantic similarity between

two tables by mapping the properties of instances to one another:

roadName City

Johnson Rd. Plano

School Dr. Richardson

Zeppelin St. Lakehurst

Alma Dr. Richardson

Preston Rd. Addison

Dallas Pkwy Dallas

Road County

Custer Pwy Cooke

15th St. Collin

Parker Rd. Collin

Alma Dr. Collin

Campbell Rd. Denton

Harry Hines Blvd.

Dallas

EBD similarity

Page 4: Semantic Web Research at University of Texas at Dallas (Schema Matching + Storage & Retrieval of RDF graph) Faculties: Latifur Khan Bhavani Thuraisingham

Representing types using N-grams*•Use commonly occurring N-grams in compared columns to determine similarity (N = 2)

StrName FENAME Status

LOCUST-GROVE DR

LOCUST GROVE

BUILT

LOUISE LN LOUISE BUILT

Street Laddress Raddress

TRAIL RANGE DR

1600 1798

CR45/MANET CT

2500 2598

CA

N-gram types from A.StrName = {LO, OC, CU,ST,…..}

N-gram types from B.Street = {TR, RA, R4, 5/,…..}

CB

1.Jeffrey Partyka, Neda Alipanah, Nilesh Singhania, Latifur Khan, Bhavani Thuraisingham, “Content Based Ontology Matching for GIS Datasets“, ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2008), Page: 407-410, Irvine, California, USA, November 2008.1.Jeffrey Partyka, Neda Alipanah, Nilesh Singhania, Latifur Khan, Bhavani Thuraisingham, “Content Based Ontology Matching for GIS Datasets“, ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2008), Page: 407-410, Irvine, California, USA, November 2008.1.Jeffrey Partyka, Neda Alipanah, Nilesh Singhania, Latifur Khan, Bhavani Thuraisingham, “Content Based Ontology Matching for GIS Datasets“, ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2008), Page: 407-410, Irvine, California, USA, November 2008.1.Jeffrey Partyka, Neda Alipanah, Nilesh Singhania, Latifur Khan, Bhavani Thuraisingham, “Content Based Ontology Matching for GIS Datasets“, ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2008), Page: 407-410, Irvine, California, USA, November 2008.

*Jeffrey Partyka, Neda Alipanah, Latifur Khan, Bhavani Thuraisingham & Shashi Shekhar, “Content Based Ontology Matching for GIS Datasets“, ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2008), Page: 407-410, Irvine, California, USA, November 2008.

Page 5: Semantic Web Research at University of Texas at Dallas (Schema Matching + Storage & Retrieval of RDF graph) Faculties: Latifur Khan Bhavani Thuraisingham

How do we measure N-gram similarity between columns?

•Entropy-Based Distribution (EBD)•EBD is a measurement of type similarity

between 2 columns:

•EBD takes values in the range of [0,1] . Greater EBD corresponds to more similar type distributions between compared columns.

EBD = H(C|T) C = C1 U C2

H(C)

Page 6: Semantic Web Research at University of Texas at Dallas (Schema Matching + Storage & Retrieval of RDF graph) Faculties: Latifur Khan Bhavani Thuraisingham

Entropy and Conditional EntropyEntropy: measure of the uncertainty

associated with a random variable:

Conditional Entropy: measures the remaining entropy of a random variable Y given the value of a second random variable X

Page 7: Semantic Web Research at University of Texas at Dallas (Schema Matching + Storage & Retrieval of RDF graph) Faculties: Latifur Khan Bhavani Thuraisingham

Visualizing Entropy and Conditional Entropy

H(C) = –Σpi log pi for all x є C1 U

C2

H(C | T) = H (C,T) – H(C) for all x є C1 U C2 and t є T

Page 8: Semantic Web Research at University of Texas at Dallas (Schema Matching + Storage & Retrieval of RDF graph) Faculties: Latifur Khan Bhavani Thuraisingham

Faults of this Method• Semantically similar columns are not

guaranteed to have a high similarity score

City Country

Dallas USA

Houston USA

Kingston Jamaica

Halifax Canada

Mexico City

Mexico

ctyName country

Shanghai China

Beijing China

Tokyo Japan

New Delhi India

Kuala Lumpur

Malaysia

2-grams extracted from A: {Da, al, la, as, Ho, ou, us…}

A є O1 B є O2

2-grams extracted from B: {Sh, ha, an, ng, gh, ha, ai, Be, ei, ij…}

Page 9: Semantic Web Research at University of Texas at Dallas (Schema Matching + Storage & Retrieval of RDF graph) Faculties: Latifur Khan Bhavani Thuraisingham

Introducing Google Distance

* Jeffrey Partyka, Neda Alipanah, Latifur Khan, Bhavani M. Thuraisingham, Shashi Shekhar, “Ontology Alignment Using Multiple Contexts”, International Semantic Web Conference (ISWC) (Posters & Demos), Karlsruhe, Germany, October, 2008.

Page 10: Semantic Web Research at University of Texas at Dallas (Schema Matching + Storage & Retrieval of RDF graph) Faculties: Latifur Khan Bhavani Thuraisingham

: Column 1

: Column 2

Similarity = H(C|T) / H(C)

C1 є O1 C2 є O2

Step 3 Calculate Similarity

Extract distinct keywords from compared columns

Group distinct keywords together into semantic clusters

Keywords extracted from columns = {Johnson, Rd., School, 15th,…}

“Rd.”,”Dr.”,”St.”,”Pwy”,…“Johnson”,”School”,”Dr.”….

C1 C2

C1 U C2

Step 1

Step 2

roadName City

Johnson Rd. Plano

School Dr. Richardson

Zeppelin St. Lakehurst

Road County

Custer Pwy Collin

15th St. Collin

Parker Rd. Collin

K-medoid + NGD instance similarity

Page 11: Semantic Web Research at University of Texas at Dallas (Schema Matching + Storage & Retrieval of RDF graph) Faculties: Latifur Khan Bhavani Thuraisingham

Problems with K-medoid + NGD*

It is possible that two different geographic entities (ie: Dallas, TX and Dallas County) in the same location will have a very low computed NGD value, and thus, be mistaken for being similar:

roadName City

Johnson Rd. Plano

School Dr. Richardson

Zeppelin St. Lakehurst

Alma Dr. Richardson

Preston Rd. Addison

Dallas Pkwy Dallas

Road County

Custer Pwy Cooke

15th St. Collin

Parker Rd. Collin

Alma Dr. Collin

Campbell Rd. Denton

Harry Hines Blvd.

Dallas

*Jeffrey Partyka, Latifur Khan, Bhavani Thuraisingham, “Semantic Schema Matching Without Shared Instances,” to appear in Third IEEE International Conference on Semantic Computing, Berkeley, CA, USA - September 14-16, 2009.

Page 12: Semantic Web Research at University of Texas at Dallas (Schema Matching + Storage & Retrieval of RDF graph) Faculties: Latifur Khan Bhavani Thuraisingham

Using geographic type information*

We use a gazetteer to determine the geographic type of an instance:

O1 O2Geotypes

*Jeffrey Partyka, Latifur Khan, Bhavani Thuraisingham, “Geographically-Typed Semantic Schema Matching,” submitted to ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2009), Seattle, Washington, USA, November 2009.

Page 13: Semantic Web Research at University of Texas at Dallas (Schema Matching + Storage & Retrieval of RDF graph) Faculties: Latifur Khan Bhavani Thuraisingham

Disambiguating Geographic Types For A Given InstanceWe can use metadata and other information to reduce the number of

type possibilities for a given instance:

City

Plano

Richardson

Dallas

……

Dallas

City

County

Dallas City

Page 14: Semantic Web Research at University of Texas at Dallas (Schema Matching + Storage & Retrieval of RDF graph) Faculties: Latifur Khan Bhavani Thuraisingham

roadName City

Johnson Rd. Plano

School Dr. Richardson

Zeppelin St. Lakehurst

Alma Dr. Richardson

Preston Rd. Addison

Dallas Pkwy Dallas

Road County

Custer Pwy Cooke

15th St. Collin

Parker Rd. Collin

Alma Dr. Collin

Campbell Rd. Denton

Harry Hines Blvd.

Dallas

Geographic Types + NGD

It is now possible to make corrections for the geographic co-occurrence mistakes of NGD:

Page 15: Semantic Web Research at University of Texas at Dallas (Schema Matching + Storage & Retrieval of RDF graph) Faculties: Latifur Khan Bhavani Thuraisingham

Disambiguation Using latlong values

• Each input consists of a name and coordinates (Lat/Long values).

• Our knowledge base consists of records for a number of different geospatial features such as streets, lakes, schools, etc. for the entire US.

• Each entry in the knowledge base contains, coordinates and other spatial information such as length and area of the landmark.

Page 16: Semantic Web Research at University of Texas at Dallas (Schema Matching + Storage & Retrieval of RDF graph) Faculties: Latifur Khan Bhavani Thuraisingham

Disambiguation Using latlong values (contd..)

Geo-Database

Page 17: Semantic Web Research at University of Texas at Dallas (Schema Matching + Storage & Retrieval of RDF graph) Faculties: Latifur Khan Bhavani Thuraisingham

Disambiguation Using latlong values (contd..)

• We first select look for the entries with similar name in knowledge base.

• Next, for each feature type in the knowledge base, we choose the entry which is located closest to the input.

• In case of two features having close proximity to the input, we disambiguate the feature type on the basis of geospatial properties like area and perimeter.

Page 18: Semantic Web Research at University of Texas at Dallas (Schema Matching + Storage & Retrieval of RDF graph) Faculties: Latifur Khan Bhavani Thuraisingham

Attribute Weighting

•Default weighting scheme is to treat all 1-1 matches between properties/attributes with equal importance:

roadName City

Johnson Rd. Plano

School Dr. Richardson

Zeppelin St. Lakehurst

Alma Dr. Richardson

Preston Rd. Addison

Dallas Pkwy Dallas

Road County

Custer Pwy Cooke

15th St. Collin

Parker Rd. Collin

Alma Dr. Collin

Campbell Rd. Denton

Harry Hines Blvd.

Dallas

50% 50%

Page 19: Semantic Web Research at University of Texas at Dallas (Schema Matching + Storage & Retrieval of RDF graph) Faculties: Latifur Khan Bhavani Thuraisingham

Results of Geographic Matching Over 2 Separate Road Network Data Sources