Using linked data to interpret tables

Preview:

DESCRIPTION

Using linked data to interpret tables. Varish Mulwad , Tim Finin , Zareen Syed and Anupam Joshi University of Maryland, Baltimore County November 8, 2010. Interpreting a table. http://dbpedia.org/class/yago/NationalBasketballAssociationTeams. dbprop:team. - PowerPoint PPT Presentation

Citation preview

Using linked data to interpret tables

Varish Mulwad, Tim Finin, Zareen Syed and Anupam Joshi

University of Maryland, Baltimore County November 8, 2010

1

Interpreting a table

Name Team Position Height

Michael Jordan Chicago Shooting guard 1.98

Allen Iverson Philadelphia Point guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power forward 2.11

http://dbpedia.org/class/yago/NationalBasketballAssociationTeams

http://dbpedia.org/class/yago/NationalBasketballAssociationTeams

http://dbpedia.org/resource/Allen_Iversonhttp://dbpedia.org/resource/Allen_Iverson Map numbers as values of properties

Map numbers as values of properties

dbprop:teamdbprop:team

Interpreting a table

Name Team Position Height

Michael Jordan Chicago Shooting guard 1.98

Allen Iverson Philadelphia Point guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power forward 2.11

@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix yago: <http://dbpedia.org/class/yago/> .

"Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer ."Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .

"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer .

"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .

@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix yago: <http://dbpedia.org/class/yago/> .

"Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer ."Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .

"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer .

"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .

Use Cases

Name Team Position Height

Michael Jordan Chicago Shooting guard 1.98

Allen Iverson Philadelphia Point guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power forward 2.11

Name Team Position Height

Michael Jordan Chicago Shooting guard 1.98

Allen Iverson Philadelphia Point guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power forward 2.11

Name Team Position Height

Michael Jordan Chicago Shooting guard 1.98

Allen Iverson Philadelphia Point guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power forward 2.11

Name Team Position Height

Michael Jordan Chicago Shooting guard 1.98

Allen Iverson Philadelphia Point guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power forward 2.11

Name Team Position Height

Michael Jordan Chicago Shooting guard 1.98

Allen Iverson Philadelphia Point guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power forward 2.11

Intelligent querying over data

Create a ‘Semantic’ knowledge-base

Use CasesName Team Position Height

Michael Jordan Chicago Shooting guard 1.98

Allen Iverson Philadelphia Point guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power forward 2.11

@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix yago: <http://dbpedia.org/class/yago/> .

"Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer ."Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .

"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer .

"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .

@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix yago: <http://dbpedia.org/class/yago/> .

"Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer ."Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .

"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer .

"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .

Data Integration

Search / Query over tables

Name Team Position Height

Michael Jordan Chicago Shooting guard 1.98

Allen Iverson Philadelphia Point guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power forward 2.11

Confirm/Verify existing knowledgeAdd new knowledge to the LOD cloud

Convert legacy data into Semantic Web formats

Motivation and Related Work

We are laying a strong foundation for the Semantic Web …

… but an old problem haunts us …

Chicken ? Egg ? … No Chicken ?

• ~ 14.1 billion tables, 154 million with high quality relational data (Cafarella et al. 2008)

• 305,632 Datasets available as CSV or spreadsheets on Data.gov (US) + 7 Other nations establishing open data

• Where is structured data ?

Automate the process

• We need systems that can generate data from existing sources

• Not practical for humans to encode all this into RDF manually

Related Work

• Database to Ontology mapping (Barrasa, scar Corcho, & Gmez-prez 2004), (Hu & Qu 2007), (Papapanagiotou et al. 2006), and (Lawrence 2004)

• Mapping Relational databases to RDF [W3C working group – RDB2RDF]

Related Work

• Mapping spreadsheets to RDF [RDF123, XLWrap]

• Practical and helpful systems but … – Require significant manual work– Do not generate linked data

• Interpreting web tables to answer complex search queries over the web tables (Limaye et al. 2010)

T2LD Framework

Predict Class for Columns

Predict Class for Columns

Linking the table cells

Linking the table cells

Identify and Discover relations

Identify and Discover relations

T2LD Framework

T2LD Framework

Predict Class for Columns

Predict Class for Columns

Linking the table cells

Linking the table cells

Identify and Discover relations

Identify and Discover relations

Predicting Class Labels for column

Team

Chicago

Philadelphia

Houston

San Antonio

Class

Instance

Class for the column

Class 1

Class 2

Class 3

Class 4

Knowledge Base

Yago

Wikitology1 – A hybrid knowledge base where structured data meets unstructured data

1 – Wikitology was created as part of Zareen Syed’s Ph.D. dissertation

Querying the Knowledge–Base

1. Chicago Bulls2. Chicago3. Judy Chicago

1. Chicago Bulls2. Chicago3. Judy Chicago

1. Philadelphia2. Philadelphia 76ers3. Philadelphia (film)

1. Philadelphia2. Philadelphia 76ers3. Philadelphia (film)

1. Houston Rockets2. Houston3. Allan Houston

1. Houston Rockets2. Houston3. Allan Houston

{dbpedia-owl:Place,dbpedia-owl:City,yago:WomenArtist,yago:LivingPeople,yago:NationalBasketballAssociationTeams }

Types

{dbpedia-owl:Place, dbpedia-owl:PopulatedPlace, dbpedia-owl:Film,yago:NationalBasketballAssociationTeams …. ….. ….. }

{……………………………………………………………. }

Team

Chicago

Philadelphia

Houston

San Antonio

Scoring the classesPossible Classes for the column - dbpedia-owl:Placedbpedia-owl:Cityyago:WomenArtistyago:LivingPeopleyago:NationalBasketballAssociationTeamsdbpedia-owl:PopulatedPlacedbpedia-owl:Film………

Possible Classes for the column - dbpedia-owl:Placedbpedia-owl:Cityyago:WomenArtistyago:LivingPeopleyago:NationalBasketballAssociationTeamsdbpedia-owl:PopulatedPlacedbpedia-owl:Film………

[Chicago, dbpedia-owl:City][Philadelphia, dbpedia-owl:City][Houston, dbpedia-owl:City] ….….[Chicago,dbpedia-owl:Film][Philadelphia,dbpedia-owl:Film]………

[Chicago, dbpedia-owl:City][Philadelphia, dbpedia-owl:City][Houston, dbpedia-owl:City] ….….[Chicago,dbpedia-owl:Film][Philadelphia,dbpedia-owl:Film]………

E.g. Processing class – “Chicago,yago:NationalBasketballAssociationTeams”

String Chicago: (R = 1) Chicago Bulls {yago:NationalBasketballAssociationTeams} [PR = 6](R = 2) Chicago {dbpedia-owl:PopulatedPlace, dbpedia-owl:City} [PR = 5](R = 3) Judy Chicago {yago:WomenArtist,yago:LivingPeople} [PR = 4]

Score = w x ( 1 / R ) + (1 – w) x (Normalized Page Rank)[Chicago, yago:NationalBasketballAssociationTeams] = (0.25 x 1 / 1 ) + (0.75 x 6 / 7) = 0.892

E.g. Processing class – “Chicago,yago:NationalBasketballAssociationTeams”

String Chicago: (R = 1) Chicago Bulls {yago:NationalBasketballAssociationTeams} [PR = 6](R = 2) Chicago {dbpedia-owl:PopulatedPlace, dbpedia-owl:City} [PR = 5](R = 3) Judy Chicago {yago:WomenArtist,yago:LivingPeople} [PR = 4]

Score = w x ( 1 / R ) + (1 – w) x (Normalized Page Rank)[Chicago, yago:NationalBasketballAssociationTeams] = (0.25 x 1 / 1 ) + (0.75 x 6 / 7) = 0.892

T2LD Framework

Predict Class for Columns

Predict Class for Columns

Linking the table cells

Linking the table cells

Identify and Discover relations

Identify and Discover relations

Machine Learning based Approach

Table Cell + Column Header + Row Data

+ Column Type

Table Cell + Column Header + Row Data

+ Column Type

Requery KB with predicted class labels as additional evidence

Requery KB with predicted class labels as additional evidence

Generate a feature vector for the top N results of the query

Generate a feature vector for the top N results of the query

Classifier ranks the entities within the set

of possible results

Classifier ranks the entities within the set

of possible results

Select the highest ranked entity

Select the highest ranked entity

A second classifier decides whether to

link or not

A second classifier decides whether to

link or not

Link to “NIL”Link to “NIL”Link to the top

ranked instanceLink to the top

ranked instance

Learning to Rank

• We trained a SVMrank classifier which learnt to rank entities within a given set

Feature VectorFeature Vector

Similarity MeasuresSimilarity Measures

Popularity MeasuresPopularity Measures

• Levenshtein distance• Dice Score• Levenshtein distance• Dice Score

• Wikitology Score• PageRank• Page Length

• Wikitology Score• PageRank• Page Length

“To Link or not to Link … ’’

• A second SVM classifier

• Feature vector included the feature vector of the top ranked entity and additional two features –

– The SVMrank score of the top ranked entity– The difference in scores between the top two

ranked entities

T2LD Framework

Predict Class for Columns

Predict Class for Columns

Linking the table cells

Linking the table cells

Identify and Discover relations

Identify and Discover relations

Identify Relations

Name

Michael Jordan

Allen Iverson

Yao Ming

Tim Duncan

Team

Chicago

Philadelphia

Houston

San Antonio

Rel ‘A’Rel ‘A’

Rel ‘A’

Rel ‘A’, ‘C’

Rel ‘A’, ‘B’, ‘C’

Rel ‘A’, ‘B’

Relation between columns

Michael Jordan - Chicago

Allen Iverson - Philadelphia

Yao Ming - Houston

Michael Jordan - Chicago

Allen Iverson - Philadelphia

Yao Ming - Houston

dbprop:teamdbprop:team

dbprop:teamdbprop:draftTeam

dbprop:teamdbprop:draftTeam

dbprop:teamdbprop:team

dbprop:team dbprop:draftTeam

dbprop:team dbprop:draftTeam

Candidate relationsCandidate relations

Scoring the relations

Michael Jordan - Chicago

Allen Iverson – Philadelphia

Yao Ming - Houston

Michael Jordan - Chicago

Allen Iverson – Philadelphia

Yao Ming - Houston

dbprop:teamdbprop:team

dbprop:team dbprop:draftTeam

dbprop:team dbprop:draftTeam

dbprop:teamdbprop:team

Candidates: dbprop:team

dbprop:draftTeam

Candidates: dbprop:team

dbprop:draftTeam

dbprop:draftTeamScore: 0dbprop:draftTeamScore: 0

dbprop:draftTeam

Score:1

dbprop:draftTeam

Score:1

dbprop:teamScore:3dbprop:teamScore:3

T2LD Framework

Predict Class for Columns

Predict Class for Columns

Linking the table cells

Linking the table cells

Identify and Discover relations

Identify and Discover relations

Annotating web tables for the Semantic Web

Table as linked RDF

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix yago: <http://dbpedia.org/class/yago/> .

"Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer ."Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .

"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer .

"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix yago: <http://dbpedia.org/class/yago/> .

"Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer ."Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .

"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer .

"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .

“Team”@en is rdfs:label of dbpedia-owl:Team .“Team” is the common / human name for the class dbpedia-owl:Team

“Team”@en is rdfs:label of dbpedia-owl:Team .“Team” is the common / human name for the class dbpedia-owl:Team

dbpedia:Chicago_Bulls a yago:NationalBasketballAssociationTeams .dbpedia:Chicago_Bulls is a type (instance) yago:NationalBasketballAssociationTeams

dbpedia:Chicago_Bulls a yago:NationalBasketballAssociationTeams .dbpedia:Chicago_Bulls is a type (instance) yago:NationalBasketballAssociationTeams

Results

Dataset summary

Number of Tables 15

Total Number of rows 199

Total Number of columns 56 (52)

Total Number of entities 639 (611)

* The number in the brackets indicates # excluding columns that contained numbers

Dataset summary

Dataset summary

Evaluation for class label predictions

Evaluation # 1 (MAP)

• Compared the system’s ranked list of labels against a human ranked list of labels

• Metric - Mean Average Precision (MAP)

• Commonly used in the Information Retrieval domain to compare two ranked sets

Evaluation # 1 (MAP)

80.76 %

System Ranked:1. Person2. Politician3. President

Evaluator Ranked:1. President2. Politician3. OfficeHolder

Evaluation # 2 (Recall)

Recall > 0.6 (75 %)

System Ranked:1. Person2. Politician3. President

Evaluator Ranked:1. President2. Politician3. OfficeHolder

Evaluation # 3 (Correctness)

• Evaluated whether our predicted class labels were “fair and correct”

• Class label may not be the most accurate one, but may be correct. – E.g. dbpedia-owl:PopulatedPlace is not the most accurate, but still

a correct label for column of cities

• Three human judges evaluated our predicted class labels

Evaluation # 3 (Correctness)

• A category-wise breakdown for class label correctnessOverall

Accuracy: 76.92 %

Column – NationalityPrediction – MilitaryConflict

Column – Birth PlacePrediction – PopulatedPlace

Evaluation for linking table cells to entities

Category-wise accuracy for linking table cells

Overall Accuracy: 66.12 %

Relation between columns

• Idea – Ask human evaluators to identify relations between columns in a given table

• Pilot Experiment – Asked three evaluators to annotate five random tables from our dataset

• Evaluators identified 20 relations

• Our accuracy – 5 out of 20 (25 % ) were correct

Conclusion and Future Work

Conclusion

• We have demonstrated that it is possible to develop a automated framework for converting tables & spreadsheets to linked data

• Extending and adapting this framework for Open government data

• Discovery of new relations between entities

References• Cafarella, M. J., Halevy, A., Wang, D. Z., Wu, E., Zhang, Y., 2008.

Webtables:exploring the power of tables on the web. Proc. VLDB Endow.1 (1), 538-549.

• Barrasa, J., Corcho, O., Gomez-perez, A., 2004. R2o, an extensible and semantically based database-to-ontology mapping language. In Proceedings of the 2nd Workshop on Semantic Web and Databases(SWDB2004). Vol. 3372. pp. 1069-1070.

• Hu, W., and Qu, Y. 2007. Discovering simple mappings between relational database schemas and ontologies. In Aberer, K.; Choi, K.-S.; Noy, N. F.; Allemang, D.; Lee, K.-I.; Nixon, L. J. B.; Golbeck, J.; Mika, P.; Maynard, D.; Mizoguchi, R.; Schreiber, G.;and Cudre-Mauroux, P., eds., ISWC/ASWC, volume 4825 of Lecture Notes in Computer Science, 225238. Springer.

• Papapanagiotou, P.; Katsiouli, P.; Tsetsos, V.; Anagnostopoulos, C.; and Hadjiefthymiades, S. 2006. Ronto: Relational to ontology schema matching. In AISSIGSEMIS BULLETIN.

• Lawrence, E. D. R. 2004. Composing mappings between schemas using a reference ontology. In In Proceedings of International Conference on Ontologies, Databases and Application of Semantics (ODBASE), 783800. Springer

• Han, L.; Finin, T.; Parr, C.; Sachs, J.; and Joshi, A. 2008. RDF123: from Spreadsheets to RDF. In Seventh International Semantic Web Conference. Springer.

• Han, L., Finin, T., Yesha, Y., 2009. Finding semantic web ontology terms from words. In: Proceedings of the Eight International Semantic Web Conference. Springer.

• Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. In: Proc. of the 36th Int'l Conference on Very Large Databases (VLDB). (2010)

References

This work was supported by:

Recommended