18
Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County http://ebiquity.umbc.edu/resource/html/id/???/ 0

Tables to Linked Data

  • Upload
    lonna

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

Tables to Linked Data. Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County. 0. http://ebiquity.umbc.edu/resource/html/id/???/. Age of Big Data. Availability of massive amounts of data is driving many technical advances - PowerPoint PPT Presentation

Citation preview

Page 1: Tables to Linked Data

Tables to Linked DataZareen Syed, Tim Finin, Varish

Mulwad and Anupam JoshiUniversity of Maryland, Baltimore County

http://ebiquity.umbc.edu/resource/html/id/???/ 0

Page 2: Tables to Linked Data

Age of Big Data• Availability of massive amounts of data is driving

many technical advances• Extracting linked data from text and tables will help• Databases & spreadsheets are obvious sources for

tables but many are in documents and web pages, too• A recent Google study found over 14B HTML tables

– M. Cafarella, A. Halevy, D. Wang, E. Wu, Y. Zhang, Webtables: exploring the power of tables on the Web, VLDB, 2008.

• Only about 0.1% had high-quality relational data• But that’s about 150M tables!

1

Page 3: Tables to Linked Data

Problem: given a table

2

Page 4: Tables to Linked Data

Generate linked data@prefix dbp: <http://dbpedia.org/resource/> .@prefix dbpo: <http://dbpedia.org/ontology/> .@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .@prefix cyc: <http://www.cyc.com/2004/06/04/cyc#> \

dbp:Boston dbpo:PopulatedPlace/leaderName dbp:Thomas_Menino; cyc:partOf dbp:Massachusetts; dbpo:populationTotal "610000"^^xsd:integer .dbp:New_York_City …...

@prefix dbp: <http://dbpedia.org/resource/> .@prefix dbpo: <http://dbpedia.org/ontology/> .@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .@prefix cyc: <http://www.cyc.com/2004/06/04/cyc#> \

dbp:Boston dbpo:PopulatedPlace/leaderName dbp:Thomas_Menino; cyc:partOf dbp:Massachusetts; dbpo:populationTotal "610000"^^xsd:integer .dbp:New_York_City …...

• Use classes, properties and instances from a linked data collection, e.g. DBpedia + Cyc + Geonames

• Confirm existing facts and discover new ones• Create new entities as needed• Create new relations when possible (harder)

3

Page 5: Tables to Linked Data

What data do we want

dbpo:Baltimoredbpo:Baltimorelink cell values to entities

find relationships between columns

dbpo:Marylanddbpo:Maryland

dbpo:largestCitydbpo:largestCity

4

Page 6: Tables to Linked Data

What evidence can we find?

• Column one’s type is populated place, or is it US city, or a reference to a NBA team?

5

Page 7: Tables to Linked Data

What do we want to extract?

• Column one’s type is populated place, or is it US city, or a reference to a NBA team?

• Column two’s type is person (or politician?) but is ‘mayor’ a type or a relation and if the later, to what?

5

Page 8: Tables to Linked Data

What do we want to extract?

• Column one’s type is populated place, or is it US city, or a reference to a NBA team?

• Column two’s type is person (or politician?) but is ‘mayor’ a type or a relation and if the later, to what?

• Rows give important evidence too: Menino has a stronger connection to Boston than Massachusetts

5

Page 9: Tables to Linked Data

What do we want to extract?

• Column one’s type is populated place, or is it US city, or a reference to a NBA team?

• Column two’s type is person (or politician?) but is ‘mayor’ a type or a relation and if the later, to what?

• Rows give important evidence too: Menino has a stronger connection to Boston than Massachusetts

• Both cities and states have populations, … 5

Page 10: Tables to Linked Data

A Web of Evidence• Table: Column headers, cell values, column position,

column adjacency• Language: headers have meaning, synonyms, …• Ontologies: capitalOf is a 1:1 relation between a

GPE region and a city• Significance: pageRank-like metrics bias linking• Facts: the LD KB asserts Boston is in MA and that

Boston’s population is close to 610K• Graph analysis: PMI between Boston & Menino is

much higher than for Massachusetts6

Page 11: Tables to Linked Data

Approach

Query Knowledge base

Predict Class for Columns

Re query Knowledge base using the new evidence

Link cell value to an entity using the new results

obtained

Input: Table Headers and

Rows

Identify Relationships

between columns

Output: Linked Data

7

Page 12: Tables to Linked Data

Wikitology• A hybrid KB of structured &

unstructured information extracted from Wikipedia

• Augmented with knowledge from DBpedia, Freebase, Yago and Wordnet

• The interface via a specialized IR index

• Good for systems that need to do a combination of reasoning over text, graphs and semi-structured data

8

Page 13: Tables to Linked Data

Querying the Knowledge–Base

For every cell from the table –

Cell Value + Column Header + Row Content

Top N entities, Their Types, Page Rank

(We use N = 5)

Wikitology

Baltimore + City + MD + S.Dixon + 640,000

1.Baltimore_Maryland2.Baltimore_County3.John_Baltimore

9

Page 14: Tables to Linked Data

Predicting Classes for Columns

• Set of Classes per column

• Score the classes

• Choose the top class from each of the four vocabularies – Dbpedia, Freebase, Wordnet and Yago

dbpedia-owl:Place, dbpedia-owl:Area, yago:AmericanConductors, yago:LivingPeople, dbpedia-owl:PopulatedPlace, dbpedia-owl:Band, dbpedia-owl:Organisation, . . . . . .

dbpedia-owl:Place, dbpedia-owl:Area, yago:AmericanConductors, yago:LivingPeople, dbpedia-owl:PopulatedPlace, dbpedia-owl:Band, dbpedia-owl:Organisation, . . . . . .

Score = w x ( 1 / R ) + (1 – w) Page RankR: Entity’s Rank;

E.g. [Baltimore,dbpedia:Area] = 0.89

Select the class that maximizes its sum of score over the entire column

[Baltimore, dbpedia:Area] + [Boston, dbpedia:Area] + [New York, dbpedia:Area] = 2.85

Score = w x ( 1 / R ) + (1 – w) Page RankR: Entity’s Rank;

E.g. [Baltimore,dbpedia:Area] = 0.89

Select the class that maximizes its sum of score over the entire column

[Baltimore, dbpedia:Area] + [Boston, dbpedia:Area] + [New York, dbpedia:Area] = 2.85

Column:City

Dbpedia:PopulatedPlaceWordnet:CityFreebase:LocationYago:CitiesinUnitedStates

Column:City

Dbpedia:PopulatedPlaceWordnet:CityFreebase:LocationYago:CitiesinUnitedStates

10

Page 15: Tables to Linked Data

Linking table cell to entities• Once the classes are predicted, we re-query the knowledge–base

with this new evidence

• Along with the original query, we also include the predicted types

• We pick the highest ranking entity which matches the predicted type from the new results

For every cell from the table –

Cell Value + Column Header + Row Content + Predicted Column Type

Top N entities, Their Types (We use N = 5)

KB

Page 16: Tables to Linked Data

Preliminary results: entity linking

• In a preliminary evaluation, we used 5 Google Squared tables comprising 23 columns and 39 rows, comparing our results with human judgments

• The next will be on selected tables from the Google col-lection of >2500 involving 6 domains: bibliography, car, course, country, movie, people

Ckasses used Accuracy

Class Prediction for Columns: Dbpedia

85.7%

Class Prediction for Columns : Freebase

90.5%

Class Prediction for Columns : Wordnet

71.4%

Class Prediction of Columns :Yago

71.4%

Entity Linking 76.6%

11

Page 17: Tables to Linked Data

Ongoing and Future work• Identifying relationships between columns• Modules for common ‘special cases’, e.g.

numbers, acronyms, phone numbers, stock symbols, email addresses, URLs, etc.

• Replace heuristics by machine learning techniques for combining evidence and clustering

12

Page 18: Tables to Linked Data

Conclusion• There’s lots of data stored in tables: in spread-

sheets, databases, Web pages and documents• In some cases we can interpret them and

generate a linked data representation• In others we can at least link some cell values

to LOD entities• This can help contribute data to the Web in a

form that is easy for machines to understand and use

13