35
Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County November 5, 2011 Dr. Tim Finin Dr. Anupam Joshi

Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

Embed Size (px)

Citation preview

Page 1: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

Automatically Generating Government Linked Data from Tables

Varish Mulwad (@varish)University of Maryland, Baltimore County

November 5, 2011

Dr. Tim Finin Dr. Anupam Joshi

Page 2: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

2

What ?

Page 3: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

3

State StateFIPS

County CountyFIPS

Group Label Value

Alabama 1 Macon 87 Farms with Black or AfricanAmerican operators

Value of sales of grains, oilseeds, dry beans, and drypeas (farms)

5

Arizona …. Navajo …. …. …. ….

Arkansas 5 Union 139 Farms with women principalOperators

Total value of agriculturalproducts sold (farms)

56

California 6 Humboldt 23 … …. 19

http://dbpedia.org/class/

AdministrativeRegion

http://dbpedia.org/resource/Arizona Map literals as values of properties

dbpedia-owl:state

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 4: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

4

State StateFIPS

County CountyFIPS

Group Label Value

Alabama 1 Macon 87 Farms with Black or AfricanAmerican operators

Value of sales of grains, oilseeds, dry beans, and drypeas (farms)

5

Arizona …. Navajo …. …. …. ….

Arkansas 5 Union 139 Farms with women principalOperators

Total value of agriculturalproducts sold (farms)

56

California 6 Humboldt 23 … …. 19

@prefix dbpedia: <http://dbpedia.org/resource/>.@prefix dbpedia-owl: <http://dbpedia.org/ontology/>.@prefix dbpprop: <http://dbpedia.org/property/>.@prefix dgtwc: <http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#>.”State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion.[ a dgtwc:DataEntry;dbpedia-owl:state dbpedia:Alabama;dbpedia:FIPS county code 000;dbpedia:Federal Information Processing Standard state code 001;dbpedia-owl:ethnicGroup “Farm with women principal operators”@en;dbpedia-owl:number 6444].

All this in a completely automated way !!

Contribution

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 5: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

5

Why ?

Page 6: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

6

Tables are everywhere !! … yet …

The web – 154 million high quality relational tables [1]

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 7: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

7

Evidence–based medicine

Figure: Evidence-Based Medicine - the Essential Role of Systematic Reviews, and the Need for Automated Text Mining Tools, IHI 2010

The idea behind Evidence-based Medicine is to judge the efficacy oftreatments or tests by meta-analyses or reviews of clinical trials. Key information in such trials is encoded in tables.

However, the rate at which meta-analyses are published remains very low … hampers effective health care treatment …

# of Clinical trials published in 2008

# of meta analysis published in 2008

Page 8: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

8

> 400,000 raw and geospatial datasets~ < 1 % in RDF

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 9: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

9

Current Systems

– Require users to have knowledge of the Semantic Web

– Do not automatically link to existing classes and entities on the Semantic Web / Linked Data cloud

– RDF data in some cases is as useless as raw data– Majority of the work focused on relational data

where schema is available– Web tables systems use ‘semantically poor

knowledge bases’

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 10: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

10

Dataset 1425

<rdf:Description rdf:about=“#entry1”><value>6444</value><label>Number of Farms</label><group>Farms with women principal operators</group><county fips>000</county fips><state fips>01</state fips><state>Alabama</state><rdf:type rdf:resource=“http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#DataEntry”/></rdf:Description>

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 11: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

11

How ?

Page 12: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

12

• Preliminary work / Baseline system

• Analysis and Evaluation of baseline

• “Domain Independent” Framework grounded in graphical models and probabilistic reasoning

Building a table interpretation framework

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 13: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

13

The System’s Brain (Knowledgebase)

Yago

Wikitology1 – A hybrid knowledgebase where structured data meets unstructured data

1 – Wikitology was created as part of Zareen Syed’s Ph.D. dissertation

Syed, Z., and Finin, T. 2011. Creating and Exploiting a Hybrid Knowledge Base for Linked Data, volume 129 of Revised Selected Papers Series: Communications in Computer and Information Science. Springer.

Page 14: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

14

The Baseline System

Page 15: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

15

T2LD Framework

Predict Class for Columns

Linking the table cells

Identify and Discover relations

T2LD Framework

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 16: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

16

Predicting Class Labels for column

State

Alabama

Arizona

Arkansas

California

Class

Instance

Introduction Related Work Baseline Results Joint Inference Conclusion

1. Alabama2.Alabama_(band)3.Alabama_(people)

{dbpedia-owl:Place, dbpedia-owl:AdministrativeRegion,yago:StatesOfTheUnitedStates, dbpedia-owl:Band, yago:NativeAmericanTribes …}

{dbpedia-owl:Place, yago:StatesOfTheUnitedStates, dbpedia-owl:Film, …. ….. ….. }

{……………………………………………………………. }

dbpedia-owl:Place, dbpedia-owl:AdministrativeRegion,yago:StatesOfTheUnitedStates, dbpedia-owl:Band, yago:NativeAmericanTribes,dbpedia-owl:Film ...

Page 17: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

17

Linking table cells to entities

Macon + County + Alabama + 1 + 87 + Farms with Black or

AfricanAmerican operators + ...

+ dbpedia-owl:AdministrativeRegio

n

1. Macon County, Alabama2. Macon County, Illinois

Classifier 1 – SVM Rank(Ranks the set of entities)

Classifier 2 – SVM (Computes Confidence)

Link to the top ranked entity

Don’t link

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 18: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

18

Identify Relations

State

Alabama

Arizona

Arkansas

California

County

Macon

Navajo

Union

Humboldt

Rel ‘A’

Rel ‘A’

Rel ‘A’, ‘C’

Rel ‘A’, ‘B’, ‘C’

Rel ‘A’, ‘B’

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 19: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

19

Generating a linked RDF representation

@prefix dbpedia: <http://dbpedia.org/resource/>.@prefix dbpedia-owl: <http://dbpedia.org/ontology/>.@prefix dbpprop: <http://dbpedia.org/property/>.@prefix dgtwc: <http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#>.”State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion.

[ a dgtwc:DataEntry;dbpedia-owl:state dbpedia:Alabama;dbpedia:FIPS county code 000;dbpedia:Federal Information Processing Standard state code 001;dbpedia-owl:ethnicGroup “Farm with women principal operators”@en;dbpedia-owl:number 6444].

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 20: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

20

Evaluation of the baseline system

Page 21: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

21

Dataset summaryNumber of Tables 15

Total Number of rows 199

Total Number of columns 56 (52)

Total Number of entities 639 (611)

* The number in the brackets indicates # excluding columns that contained numbers

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 22: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

22

Evaluation # 1 (MAP)• Compared the system’s ranked list of labels

against a human–ranked list of labels

• Metric - Average Precision (a.p.) [Mean Average Precision gives a mean over set of queries]

• Commonly used in the Information Retrieval domain to compare two ranked sets

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 23: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

23

Evaluation # 1 (MAP)

0 10 20 30 40 50 600

0.2

0.4

0.6

0.8

1

1.2

Average PrecisionAverage Precision

Column #

Ave

rage

Pre

cisi

on

MAP = 0.411

System Ranked:1. Person2. Politician3. President

Evaluator Ranked:1. President2. Politician3. OfficeHolder

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 24: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

24

Accuracy for Entity Linking

Person Place Organization Other0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

83.05% 80.43%61.90%

29.22%

16.95% 19.57%38.10%

70.78%

IncorrectCorrect

Categories

% o

f cor

rect

and

inco

rrec

t ins

tanc

es li

nked

Overall Accuracy: 66.12 %

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 25: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

25

Lessons Learnt

• Sequential System – Error percolated from one phase to the next

• Current system favors general classes over specific ones (MAP score = 0.411)

• Largely, a system driven by “heuristics”• Although we consider evidence, we don’t do

assignment jointly

Predict Class for Columns

Linking the table cells

Identify and Discover relations

T2LD Framework

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 26: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

26

KB

a,b,c,…

m,n,o,… x,y,z,…

Probabilistic Graphical Model / Joint Inference Model

KB

Domain Knowledge – Linked Data Cloud / Medical Domain / Open Govt.

DomainQuery

Linked Data

A “Domain Independent” Framework

Page 27: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

27

Joint Inference over evidence in a table

Probabilistic Graphical Models

Page 28: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

28

Parameterized graphical model

C1 C2C3

𝝍𝟓

R11 R12 R13 R21 R22 R23 R31 R32 R33

𝝍𝟑 𝝍𝟑 𝝍𝟑

𝝍𝟒 𝝍𝟒 𝝍𝟒

Function that captures the affinity between the column headers and row values

Row value

Variable Node: Column header

Captures interaction between column headers

Captures interaction between row values

Factor Node

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 29: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

29

Challenges

Page 30: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

30

Challenges - Literals

Population

690,000

345,000

510,020

120,000

Age

75

65

50

25

Introduction Related Work Baseline Results Joint Inference Conclusion

Population / Profit ?

Age / Percentage ?

Use evidence from the rest of the table to decide

Page 31: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

31

Challenges - Metadata

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 32: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

32

More Challenges !

• Sampling and Interpretation– Data set 1425 has > 400,000 rows !

• Human in the Loop

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 33: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

Conclusion• Presented a framework for inferring the semantics of

tables and generating Linked data

• Evaluation of the baseline system show feasibility in tackling the problem

• Work in progress for building framework grounded in graphical models and probabilistic reasoning

• Working on tackling challenges posed by tables from domains such as the medical and open government data

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 34: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

34

References1. Cafarella, M. J.; Halevy, A. Y.; Wang, Z. D.; Wu, E.; and Zhang, Y. 2008.

Webtables:exploring the power of tables on the web. PVLDB 1(1):538–549

2. M. Hurst. Towards a theory of tables. IJDAR,8(2-3):123-131, 2006.

3. D. W. Embley, D. P. Lopresti, and G. Nagy. Notes on contemporary table recognition. In Document Analysis Systems, pages 164-175, 2006.

4. Wang, Jingjing, Shao, Bin, Wang, Haixun, and Zhu, Kenny Q. Understanding tables on the web. Technical report, Microsoft Research Asia, 2010.

5. Venetis Petros, Halevy Alon, Madhavan Jayant, Pasca Marius, Shen Warren, Wu Fei, Miao Gengxin, and Wu Chung. Recovering semantics of tables on the web. In Proc. of the 37th Int'l Conference on Very Large Databases (VLDB), 2011.

6. Limaye Girija, Sarawagi Sunita, and Chakrabarti Soumen. Annotating and searching web tables using entities, types and relationships. In Proc. of the 36th Int'l Conference on Very Large Databases (VLDB), 2010

Page 35: Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County

35

Thank You ! Questions ?

[email protected]

@varishhttp://ebiq.org/h/Varish/Mulwad

[email protected] [email protected]