46
Using linked data to interpret tables Varish Mulwad, Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County November 8, 2010 1

Using linked data to interpret tables

  • Upload
    cadee

  • View
    71

  • Download
    0

Embed Size (px)

DESCRIPTION

Using linked data to interpret tables. Varish Mulwad , Tim Finin , Zareen Syed and Anupam Joshi University of Maryland, Baltimore County November 8, 2010. Interpreting a table. http://dbpedia.org/class/yago/NationalBasketballAssociationTeams. dbprop:team. - PowerPoint PPT Presentation

Citation preview

Page 1: Using  linked data  to interpret  tables

Using linked data to interpret tables

Varish Mulwad, Tim Finin, Zareen Syed and Anupam Joshi

University of Maryland, Baltimore County November 8, 2010

1

Page 2: Using  linked data  to interpret  tables

Interpreting a table

Name Team Position Height

Michael Jordan Chicago Shooting guard 1.98

Allen Iverson Philadelphia Point guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power forward 2.11

http://dbpedia.org/class/yago/NationalBasketballAssociationTeams

http://dbpedia.org/class/yago/NationalBasketballAssociationTeams

http://dbpedia.org/resource/Allen_Iversonhttp://dbpedia.org/resource/Allen_Iverson Map numbers as values of properties

Map numbers as values of properties

dbprop:teamdbprop:team

Page 3: Using  linked data  to interpret  tables

Interpreting a table

Name Team Position Height

Michael Jordan Chicago Shooting guard 1.98

Allen Iverson Philadelphia Point guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power forward 2.11

@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix yago: <http://dbpedia.org/class/yago/> .

"Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer ."Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .

"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer .

"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .

@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix yago: <http://dbpedia.org/class/yago/> .

"Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer ."Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .

"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer .

"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .

Page 4: Using  linked data  to interpret  tables

Use Cases

Name Team Position Height

Michael Jordan Chicago Shooting guard 1.98

Allen Iverson Philadelphia Point guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power forward 2.11

Name Team Position Height

Michael Jordan Chicago Shooting guard 1.98

Allen Iverson Philadelphia Point guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power forward 2.11

Name Team Position Height

Michael Jordan Chicago Shooting guard 1.98

Allen Iverson Philadelphia Point guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power forward 2.11

Name Team Position Height

Michael Jordan Chicago Shooting guard 1.98

Allen Iverson Philadelphia Point guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power forward 2.11

Name Team Position Height

Michael Jordan Chicago Shooting guard 1.98

Allen Iverson Philadelphia Point guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power forward 2.11

Intelligent querying over data

Create a ‘Semantic’ knowledge-base

Page 5: Using  linked data  to interpret  tables

Use CasesName Team Position Height

Michael Jordan Chicago Shooting guard 1.98

Allen Iverson Philadelphia Point guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power forward 2.11

@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix yago: <http://dbpedia.org/class/yago/> .

"Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer ."Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .

"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer .

"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .

@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix yago: <http://dbpedia.org/class/yago/> .

"Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer ."Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .

"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer .

"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .

Data Integration

Search / Query over tables

Name Team Position Height

Michael Jordan Chicago Shooting guard 1.98

Allen Iverson Philadelphia Point guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power forward 2.11

Confirm/Verify existing knowledgeAdd new knowledge to the LOD cloud

Convert legacy data into Semantic Web formats

Page 6: Using  linked data  to interpret  tables

Motivation and Related Work

Page 7: Using  linked data  to interpret  tables

We are laying a strong foundation for the Semantic Web …

… but an old problem haunts us …

Page 8: Using  linked data  to interpret  tables

Chicken ? Egg ? … No Chicken ?

• ~ 14.1 billion tables, 154 million with high quality relational data (Cafarella et al. 2008)

• 305,632 Datasets available as CSV or spreadsheets on Data.gov (US) + 7 Other nations establishing open data

• Where is structured data ?

Page 9: Using  linked data  to interpret  tables

Automate the process

• We need systems that can generate data from existing sources

• Not practical for humans to encode all this into RDF manually

Page 10: Using  linked data  to interpret  tables

Related Work

• Database to Ontology mapping (Barrasa, scar Corcho, & Gmez-prez 2004), (Hu & Qu 2007), (Papapanagiotou et al. 2006), and (Lawrence 2004)

• Mapping Relational databases to RDF [W3C working group – RDB2RDF]

Page 11: Using  linked data  to interpret  tables

Related Work

• Mapping spreadsheets to RDF [RDF123, XLWrap]

• Practical and helpful systems but … – Require significant manual work– Do not generate linked data

• Interpreting web tables to answer complex search queries over the web tables (Limaye et al. 2010)

Page 12: Using  linked data  to interpret  tables

T2LD Framework

Predict Class for Columns

Predict Class for Columns

Linking the table cells

Linking the table cells

Identify and Discover relations

Identify and Discover relations

T2LD Framework

Page 13: Using  linked data  to interpret  tables

T2LD Framework

Predict Class for Columns

Predict Class for Columns

Linking the table cells

Linking the table cells

Identify and Discover relations

Identify and Discover relations

Page 14: Using  linked data  to interpret  tables

Predicting Class Labels for column

Team

Chicago

Philadelphia

Houston

San Antonio

Class

Instance

Class for the column

Class 1

Class 2

Class 3

Class 4

Page 15: Using  linked data  to interpret  tables

Knowledge Base

Yago

Wikitology1 – A hybrid knowledge base where structured data meets unstructured data

1 – Wikitology was created as part of Zareen Syed’s Ph.D. dissertation

Page 16: Using  linked data  to interpret  tables

Querying the Knowledge–Base

1. Chicago Bulls2. Chicago3. Judy Chicago

1. Chicago Bulls2. Chicago3. Judy Chicago

1. Philadelphia2. Philadelphia 76ers3. Philadelphia (film)

1. Philadelphia2. Philadelphia 76ers3. Philadelphia (film)

1. Houston Rockets2. Houston3. Allan Houston

1. Houston Rockets2. Houston3. Allan Houston

{dbpedia-owl:Place,dbpedia-owl:City,yago:WomenArtist,yago:LivingPeople,yago:NationalBasketballAssociationTeams }

Types

{dbpedia-owl:Place, dbpedia-owl:PopulatedPlace, dbpedia-owl:Film,yago:NationalBasketballAssociationTeams …. ….. ….. }

{……………………………………………………………. }

Team

Chicago

Philadelphia

Houston

San Antonio

Page 17: Using  linked data  to interpret  tables

Scoring the classesPossible Classes for the column - dbpedia-owl:Placedbpedia-owl:Cityyago:WomenArtistyago:LivingPeopleyago:NationalBasketballAssociationTeamsdbpedia-owl:PopulatedPlacedbpedia-owl:Film………

Possible Classes for the column - dbpedia-owl:Placedbpedia-owl:Cityyago:WomenArtistyago:LivingPeopleyago:NationalBasketballAssociationTeamsdbpedia-owl:PopulatedPlacedbpedia-owl:Film………

[Chicago, dbpedia-owl:City][Philadelphia, dbpedia-owl:City][Houston, dbpedia-owl:City] ….….[Chicago,dbpedia-owl:Film][Philadelphia,dbpedia-owl:Film]………

[Chicago, dbpedia-owl:City][Philadelphia, dbpedia-owl:City][Houston, dbpedia-owl:City] ….….[Chicago,dbpedia-owl:Film][Philadelphia,dbpedia-owl:Film]………

E.g. Processing class – “Chicago,yago:NationalBasketballAssociationTeams”

String Chicago: (R = 1) Chicago Bulls {yago:NationalBasketballAssociationTeams} [PR = 6](R = 2) Chicago {dbpedia-owl:PopulatedPlace, dbpedia-owl:City} [PR = 5](R = 3) Judy Chicago {yago:WomenArtist,yago:LivingPeople} [PR = 4]

Score = w x ( 1 / R ) + (1 – w) x (Normalized Page Rank)[Chicago, yago:NationalBasketballAssociationTeams] = (0.25 x 1 / 1 ) + (0.75 x 6 / 7) = 0.892

E.g. Processing class – “Chicago,yago:NationalBasketballAssociationTeams”

String Chicago: (R = 1) Chicago Bulls {yago:NationalBasketballAssociationTeams} [PR = 6](R = 2) Chicago {dbpedia-owl:PopulatedPlace, dbpedia-owl:City} [PR = 5](R = 3) Judy Chicago {yago:WomenArtist,yago:LivingPeople} [PR = 4]

Score = w x ( 1 / R ) + (1 – w) x (Normalized Page Rank)[Chicago, yago:NationalBasketballAssociationTeams] = (0.25 x 1 / 1 ) + (0.75 x 6 / 7) = 0.892

Page 18: Using  linked data  to interpret  tables

T2LD Framework

Predict Class for Columns

Predict Class for Columns

Linking the table cells

Linking the table cells

Identify and Discover relations

Identify and Discover relations

Page 19: Using  linked data  to interpret  tables

Machine Learning based Approach

Table Cell + Column Header + Row Data

+ Column Type

Table Cell + Column Header + Row Data

+ Column Type

Requery KB with predicted class labels as additional evidence

Requery KB with predicted class labels as additional evidence

Generate a feature vector for the top N results of the query

Generate a feature vector for the top N results of the query

Classifier ranks the entities within the set

of possible results

Classifier ranks the entities within the set

of possible results

Select the highest ranked entity

Select the highest ranked entity

A second classifier decides whether to

link or not

A second classifier decides whether to

link or not

Link to “NIL”Link to “NIL”Link to the top

ranked instanceLink to the top

ranked instance

Page 20: Using  linked data  to interpret  tables

Learning to Rank

• We trained a SVMrank classifier which learnt to rank entities within a given set

Feature VectorFeature Vector

Similarity MeasuresSimilarity Measures

Popularity MeasuresPopularity Measures

• Levenshtein distance• Dice Score• Levenshtein distance• Dice Score

• Wikitology Score• PageRank• Page Length

• Wikitology Score• PageRank• Page Length

Page 21: Using  linked data  to interpret  tables

“To Link or not to Link … ’’

• A second SVM classifier

• Feature vector included the feature vector of the top ranked entity and additional two features –

– The SVMrank score of the top ranked entity– The difference in scores between the top two

ranked entities

Page 22: Using  linked data  to interpret  tables

T2LD Framework

Predict Class for Columns

Predict Class for Columns

Linking the table cells

Linking the table cells

Identify and Discover relations

Identify and Discover relations

Page 23: Using  linked data  to interpret  tables

Identify Relations

Name

Michael Jordan

Allen Iverson

Yao Ming

Tim Duncan

Team

Chicago

Philadelphia

Houston

San Antonio

Rel ‘A’Rel ‘A’

Rel ‘A’

Rel ‘A’, ‘C’

Rel ‘A’, ‘B’, ‘C’

Rel ‘A’, ‘B’

Page 24: Using  linked data  to interpret  tables

Relation between columns

Michael Jordan - Chicago

Allen Iverson - Philadelphia

Yao Ming - Houston

Michael Jordan - Chicago

Allen Iverson - Philadelphia

Yao Ming - Houston

dbprop:teamdbprop:team

dbprop:teamdbprop:draftTeam

dbprop:teamdbprop:draftTeam

dbprop:teamdbprop:team

dbprop:team dbprop:draftTeam

dbprop:team dbprop:draftTeam

Candidate relationsCandidate relations

Page 25: Using  linked data  to interpret  tables

Scoring the relations

Michael Jordan - Chicago

Allen Iverson – Philadelphia

Yao Ming - Houston

Michael Jordan - Chicago

Allen Iverson – Philadelphia

Yao Ming - Houston

dbprop:teamdbprop:team

dbprop:team dbprop:draftTeam

dbprop:team dbprop:draftTeam

dbprop:teamdbprop:team

Candidates: dbprop:team

dbprop:draftTeam

Candidates: dbprop:team

dbprop:draftTeam

dbprop:draftTeamScore: 0dbprop:draftTeamScore: 0

dbprop:draftTeam

Score:1

dbprop:draftTeam

Score:1

dbprop:teamScore:3dbprop:teamScore:3

Page 26: Using  linked data  to interpret  tables

T2LD Framework

Predict Class for Columns

Predict Class for Columns

Linking the table cells

Linking the table cells

Identify and Discover relations

Identify and Discover relations

Page 27: Using  linked data  to interpret  tables

Annotating web tables for the Semantic Web

Page 28: Using  linked data  to interpret  tables

Table as linked RDF

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix yago: <http://dbpedia.org/class/yago/> .

"Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer ."Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .

"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer .

"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix yago: <http://dbpedia.org/class/yago/> .

"Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer ."Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .

"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer .

"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .

“Team”@en is rdfs:label of dbpedia-owl:Team .“Team” is the common / human name for the class dbpedia-owl:Team

“Team”@en is rdfs:label of dbpedia-owl:Team .“Team” is the common / human name for the class dbpedia-owl:Team

dbpedia:Chicago_Bulls a yago:NationalBasketballAssociationTeams .dbpedia:Chicago_Bulls is a type (instance) yago:NationalBasketballAssociationTeams

dbpedia:Chicago_Bulls a yago:NationalBasketballAssociationTeams .dbpedia:Chicago_Bulls is a type (instance) yago:NationalBasketballAssociationTeams

Page 29: Using  linked data  to interpret  tables

Results

Page 30: Using  linked data  to interpret  tables

Dataset summary

Number of Tables 15

Total Number of rows 199

Total Number of columns 56 (52)

Total Number of entities 639 (611)

* The number in the brackets indicates # excluding columns that contained numbers

Page 31: Using  linked data  to interpret  tables

Dataset summary

Page 32: Using  linked data  to interpret  tables

Dataset summary

Page 33: Using  linked data  to interpret  tables

Evaluation for class label predictions

Page 34: Using  linked data  to interpret  tables

Evaluation # 1 (MAP)

• Compared the system’s ranked list of labels against a human ranked list of labels

• Metric - Mean Average Precision (MAP)

• Commonly used in the Information Retrieval domain to compare two ranked sets

Page 35: Using  linked data  to interpret  tables

Evaluation # 1 (MAP)

80.76 %

System Ranked:1. Person2. Politician3. President

Evaluator Ranked:1. President2. Politician3. OfficeHolder

Page 36: Using  linked data  to interpret  tables

Evaluation # 2 (Recall)

Recall > 0.6 (75 %)

System Ranked:1. Person2. Politician3. President

Evaluator Ranked:1. President2. Politician3. OfficeHolder

Page 37: Using  linked data  to interpret  tables

Evaluation # 3 (Correctness)

• Evaluated whether our predicted class labels were “fair and correct”

• Class label may not be the most accurate one, but may be correct. – E.g. dbpedia-owl:PopulatedPlace is not the most accurate, but still

a correct label for column of cities

• Three human judges evaluated our predicted class labels

Page 38: Using  linked data  to interpret  tables

Evaluation # 3 (Correctness)

• A category-wise breakdown for class label correctnessOverall

Accuracy: 76.92 %

Column – NationalityPrediction – MilitaryConflict

Column – Birth PlacePrediction – PopulatedPlace

Page 39: Using  linked data  to interpret  tables

Evaluation for linking table cells to entities

Page 40: Using  linked data  to interpret  tables

Category-wise accuracy for linking table cells

Overall Accuracy: 66.12 %

Page 41: Using  linked data  to interpret  tables

Relation between columns

• Idea – Ask human evaluators to identify relations between columns in a given table

• Pilot Experiment – Asked three evaluators to annotate five random tables from our dataset

• Evaluators identified 20 relations

• Our accuracy – 5 out of 20 (25 % ) were correct

Page 42: Using  linked data  to interpret  tables

Conclusion and Future Work

Page 43: Using  linked data  to interpret  tables

Conclusion

• We have demonstrated that it is possible to develop a automated framework for converting tables & spreadsheets to linked data

• Extending and adapting this framework for Open government data

• Discovery of new relations between entities

Page 44: Using  linked data  to interpret  tables

References• Cafarella, M. J., Halevy, A., Wang, D. Z., Wu, E., Zhang, Y., 2008.

Webtables:exploring the power of tables on the web. Proc. VLDB Endow.1 (1), 538-549.

• Barrasa, J., Corcho, O., Gomez-perez, A., 2004. R2o, an extensible and semantically based database-to-ontology mapping language. In Proceedings of the 2nd Workshop on Semantic Web and Databases(SWDB2004). Vol. 3372. pp. 1069-1070.

• Hu, W., and Qu, Y. 2007. Discovering simple mappings between relational database schemas and ontologies. In Aberer, K.; Choi, K.-S.; Noy, N. F.; Allemang, D.; Lee, K.-I.; Nixon, L. J. B.; Golbeck, J.; Mika, P.; Maynard, D.; Mizoguchi, R.; Schreiber, G.;and Cudre-Mauroux, P., eds., ISWC/ASWC, volume 4825 of Lecture Notes in Computer Science, 225238. Springer.

• Papapanagiotou, P.; Katsiouli, P.; Tsetsos, V.; Anagnostopoulos, C.; and Hadjiefthymiades, S. 2006. Ronto: Relational to ontology schema matching. In AISSIGSEMIS BULLETIN.

Page 45: Using  linked data  to interpret  tables

• Lawrence, E. D. R. 2004. Composing mappings between schemas using a reference ontology. In In Proceedings of International Conference on Ontologies, Databases and Application of Semantics (ODBASE), 783800. Springer

• Han, L.; Finin, T.; Parr, C.; Sachs, J.; and Joshi, A. 2008. RDF123: from Spreadsheets to RDF. In Seventh International Semantic Web Conference. Springer.

• Han, L., Finin, T., Yesha, Y., 2009. Finding semantic web ontology terms from words. In: Proceedings of the Eight International Semantic Web Conference. Springer.

• Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. In: Proc. of the 36th Int'l Conference on Very Large Databases (VLDB). (2010)

References

Page 46: Using  linked data  to interpret  tables

This work was supported by: