87
Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

Embed Size (px)

Citation preview

Page 1: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

Machines learnt how to understand tables. What happens next will shock you.

Welcome to the PhD

dissertation defense of

Varish Mulwad!

Page 2: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

2

TABEL – Domain Independent and

Extensible Framework to Infer the Semantics of

Tables Varish Mulwad

Ph.D. Dissertation Defense

Adviser: Dr. Tim FininJanuary 8, 2015

Page 3: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

3

Page 4: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

4Zareen Syed, Tim Finin, Varish Mulwad, and Anupam Joshi, "Exploiting a Web of Semantic Data for Interpreting Tables", In 2nd Web Science Conference (WebSci 2010), Raleigh, NC, USA, Apr. 2010

Semantics of a Table

Name Team Position Height

Michael Jordan

Chicago Shooting Guard

1.98

Allen Iverson Philadelphia Point Guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan

San Antonio Power Forward

2.11

NationalBasketballAssociationTeams

http://dbpedia.org/resource/Allen_Iverson Map literals as

property values

playsFor

Page 5: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

5

Semantics of a TableName Team Position Height

Michael Jordan

Chicago Shooting Guard

1.98

Allen Iverson Philadelphia Point Guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan

San Antonio Power Forward

2.11

Linked

Data

tab:cell_01 a tab:ColumnHeader; tab:cellLabel "Name"^^xsd:String; tab:columnIndex "1"^^xsd:Integer; tab:valueType dbpedia-owl:BasketballPlayer.

tab:cell_11 a tab:DataCell; tab:cellLabel "Michael Jordan"^^xsd:String; tab:columnIndex "1"^^xsd:Integer; tab:rowIndex "1"^^xsd:Integer; tab:entity dbpedia:Michael_Jordan.

All this in a completely automated way!

Page 6: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

6

TABEL – Domain Independent & Extensible

Framework to Infer the Semantics of Tables

Page 7: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

7

Thesis Statement

It is possible to generate high quality linked data from tables by jointly inferring the semantics of column headers, values (string and literal) in table cells, and relations between columns augmented with background knowledge from open data sources such as the Linked Open Data cloud.

Page 8: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

8

Contributions

o Probabilistic Graphical Model to jointly infer the semantics + a novel inference technique Semantic Message Passing

o An proof of concept user–interactive application to generate meta-analysis reports automatically

o Develop & Explore Human in the Loop paradigm

o A novel technique to generate candidate properties from literal values

Page 9: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

9

Why

How Evaluation

Application

Wrap up

Page 10: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

10

Tables are everywhere!

154 million high quality relational tables on the web

~400,000 CSVs on data.gov

Healthcare, Financial and other domains

Page 11: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

11

The Semantic Web & the Web

Spreadsheets/CSVs to RDF/OWL

Page 12: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

12

Evidence Based MedicineCombine: All studies that compare organic milk v/s grass fed cow milk

Produce Unified report: Organic Milk is better!

Meta – Analysis report

Correlation between Cardio vascular risk factors and Venous Thrombosis

Duration of proton pump inhibitors as first line of treatment for Helicobacter pylori eradication

Page 13: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

13

Tables are valuable

Page 14: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

14

Meta – Analysis: Today

Correlation between Cardio vascular risk factors and Venous Thrombosis1

Initial Search >> 1949 studiesFinal # of studies selected >> 22!

1 - W. Ageno, C. Becattini, T. Brighton, R. Selby, and P. W. Kamphuisen,”Cardiovascular risk factors and venous thromboembolism a meta-analysis,” Circulation, vol. 117, no. 1, pp. 93–102, 2008.

• Keyword based search

• Initial search yields large # of results

• Manually filter out irrelevant results

Page 15: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

15

Not restricted to healthcare …

Page 16: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

16

Related Work

Databases & Spreadsheets to RDF:

Existing solutions: Largely manual or semi-automaticNumber of Ontologies, classes, relationsAutomatic solutions: “Row as RDF node”; local mappingsNo links to existing classes, properties, entities

Page 17: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

17

Related Work

Semantics of Table:

Infer semantics for only parts of the table [header cells; relation between headers; data cell values or a combination of the two]

Fail to generate RDF Linked Data representation

Poor support for literals

Page 18: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

18

Related Work Limaye et al. [Sep. 2010][Soumen Chakrabarti’s group @ IIT-B]

RDF Linked Data representationLiteral values

Knoblock et al. [May 2012]

[Craig Knoblock’s group @ USC – ISI]

Largely focuses on header cell semantics & relation between headersRequires initial user input before automatic predictions from the system

Venetis et al. [Sep. 2011][Alon Halevy’s group @ Google]

Column header and Relation semanticsLiteral values; RDF Linked Data

Page 19: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

19

What TABEL brings to the “table” Infers the complete semantics of a table

Generates a RDF Linked Data representation

Supports tables with different structures over a variety of domains [medical tables]

Incorporates user feedback to improve the quality of inferred semantics

Infers the semantics of literal values* [numerical values]

Page 20: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

20

Why How Evaluation

Application

Wrap up

Page 21: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

21

TABEL – TABle Extracted as Linked Data

DECODE AAD

Pre-processing modules

Query and Rank

1

Generate RDF Linked

Data

Verify (optional)

Store / Publish

Joint Inference

Name Team Position Height

Michael Jordan Chicago Shooting Guard

1.98

Allen Iverson Philadelphia Point Guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power Forward 2.11

Your module here!

Varish Mulwad, Tim Finin and Anupam Joshi, “A Domain Independent Framework for Extracting Linked Semantic Data from Tables”, In Search Computing, ISBN 978-3-642-34212-7, vol. 7538, 2012.

Page 22: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

22

Query – Candidate Entities

Chicago + Context {Team} + Context {Michael Jordan, Shooting Guard, 1.98}

1. Chicago2. Judy_Chicago3. Chicago_Bulls

1. Chicago_Bulls2. Chicago3. Judy_Chicago

1. Chicago2. Judy_Chicago3. Chicago_Bulls

Re-rank – Classifier(String Similarity, Popularity)

Varish Mulwad, Tim Finin, Zareen Syed and Anupam Joshi, “Using linked data to interpret tables”, In 1st Int. Workshop on Consuming Linked Data, held at the 9th Int. Semantic Web Conf. (ISWC 2010), Shanghai, China, Nov. 2010.

Page 23: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

23

Query – Candidate ClassesClass

Instance

1. Chicago_Bulls2. Chicago3. Judy_Chicago

{Place,City, WomenArtist, LivingPeople, NationalBasketballAssociationTeams }

{Place, PopulatedPlace, Film, NationalBasketballAssociationTeams, … , … }

{……………………………………………………………. }

Place, City, WomenArtist, LivingPeople, NationalBasketballAssociationTeams, PopulatedPlace, Film ….

Team

Chicago

Philadelphia

Houston

San Antonio

Page 24: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

24

Query – Candidate Relations

Name

Michael Jordan

Allen Iverson

Yao Ming

Tim Duncan

Team

Chicago

Philadelphia

Houston

San Antonio

1. Chicago_Bulls2. Chicago3. Judy_Chicago

1. Michael_Jordan2. Michael_I_Jordan3. Jordan_River

playsForlivesIn….….

…… ……

playsFor, livesIn,born, …….

Page 25: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

25

Query – Literals* [numeral data]

Team

Chicago

Philadelphia

Houston

San Antonio

Place, City, WomenArtist, LivingPeople, NationalBasketballAssociationTeams, PopulatedPlace, Film

Chicago

Page 26: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

26

Query – Literals

1.98

1.83

2.29

2.11

?

Page 27: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

27

NumKB

320,900

183,120

229,198

211,123

Population

Income

1.98

1.83

2.29

2.11

Height

Person BasketBallPlayer(?)

NumKB: Encodes distributional features for Linked Data properties

Allows query using literal values (and optionally property name)

Provides information on property domains

250,0001.95

Page 28: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

28

Identify property domains seatingCapacity

Get InstancesGet Instance

TypesOrder by frequency

Queen's_Film_TheatreRestaurant_Gordon_RamsayM&T_Bank_Stadium

TheatreStadiumRestaurant

1. seatingCapacity_Stadium [1]2. seatingCapacity_Theatre [0.70]3. seatingCapacity_Restaurant [0.57]

Duplet score: 1/

Page 29: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

29

Identify property domain duplet values

Property, domain [seatingCapacity,Stad

ium]

Get Property Values

Sort; Trim front & back

tails; Compute µ &

σ

1777720767500

-212 : 25743 [86.66 %]-13190 : 38721 [6.56 %]

-26168 : 51699 [4.67 %]-39146 : 64677 [2.08 %]

Compute Ranges

µ - σ : µ + σµ - 2σ : µ + 2σ

Page 30: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

30

Query – Literals

1.98, height

NumKB1. height2. diameter3. minimumElevation

minRange < 1.98 < maxRange

Fuzzy string match (ColHeaderString, PropertyName)

Page 31: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

31

Graphical Model for Tables

C1 C2 C3

R11

R12

R13

R21

R22

R23

R31

R32

R33

Team

Chicago

Philadelphia

Houston

San Antonio

Class

Instance

NameVice-

PresidentOffice Held

Beetle RedGasolin

e

Page 32: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

32

Parameterized Graphical Model

C1 C2 C3

𝝍𝟓

R11

R12

R13

R21

R22

R23

R31

R32

R33

𝝍𝟑 𝝍𝟑 𝝍𝟑

𝝍𝟒 𝝍𝟒 𝝍𝟒

Function that captures the affinity between the column headers and row values

Row value

Variable Node: Column header

Captures interaction between column headers

Captures interaction between row values

Page 33: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

33

Semantic Message Passing𝝍𝟒

𝝍𝟑

Michael_I_Jordan Chicago_Bulls

“Change”playsFor

“No Change”

C1:[BasketballPlayer

]

C2:[NBATeam] C3:[BasketBallPositions

]

𝝍𝟓

Yao_MingAllen_Iverson

BasketballPlayer“Change”BasketBall

Player

“No Change”“No Change”

……

……

“No Change”

“No Change”

Page 34: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

34

Semantic Message Passing[V] Pick new

value

[V] Send current values

[F] Identify Outliers

[F] Send semantics

V – Variable NodesF – Factor Nodes

Semantically Aware Factor

Nodes

Varish Mulwad, Tim Finin and Anupam Joshi, "Semantic Message Passing for Generating Linked Data from Tables", In 12th Int. Semantic Web Conf. (ISWC 2013), Sydney, Australia, Oct. 2013.

Page 35: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

35

– Column Header & Row Value Agreement

𝝍𝟑 [Michael_I_Jordan, Allen_Iverson, Yao_Ming]

GeoPopulatedPlaceBasketBallPlayerArtWorkName

Michael_I_Jordan

Allen_Iverson

Yao_MingAtheleteBasketballPlayer

ArtificialIntelligenceResearchers

1. BasketBallPlayer2.GeoPopulatedPlace….

Top Class: BasketBallPlayer topClassScore =

Page 36: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

36

– Column Header & Row Value Agreement

Use the topClass in Message Passing process

Send topClassScore as confidence score

Name

Michael_I_Jordan

Allen_Iverson

Yao_Ming

Change

No - Change

Update Column Header Annotation = “No-Annotation”

topClassScore < thresholdclass ?

BasketBallPlayer

Page 37: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

37

𝝍4 – Relation between Columns[Michael_I_Jordan, Chicago_Bulls][Allen_Iverson, Philadelphia_76ers][Yao_Ming, Houston_Rockets]

𝝍𝟒

Team

Chicago_Bulls

Philadelphia_76ers

Houston_Rockets

Name

Michael_I_Jordan

Allen_Iverson

Yao_Ming

playsForlivesIn….….

No – rel

playsFor

playsFor

1. playsFor2. livesIn….….

Top relation: playsFor

topRelScore =

Page 38: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

38

𝝍4 – Relation between Columns

Use the topRel in Message Passing process

Send topRelScore as confidence

Update Rel Annotation = “No-

Annotation”

topRelScore < thresholdrelation ?

Name

Michael_I_Jordan

Allen_Iverson

Yao_Ming

ChangeplaysFor

No - Change

Page 39: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

39

Variable Node Update

R11

Michael Jordan

𝝍𝟑𝝍𝟒

𝝍𝟒Change [BasketBallPlayer, 0.8]

Change

[playsFor,

0.6]

No-Change[0.55]

(Team)

(Chicago)

(Shooting Guard)

avgChangeConfidenceScore > avgNoChangeConfidenceScore ? = 0.70] [0.5

5]

Page 40: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

40

Variable Node Update

[Class: BasketBallPlayer, 0.8][Relation: playsFor, 0.6]

R11

Michael Jordan

(1)BasketBallPlayer

(2)playsFor

Michael_I_Jordan

……..

Michael_Jordan

……..

Satisfy constraints: [1, 2, 3]Satisfy constraints: [1, 2]Satisfy constraints: [1,3]Satisfy constraints: [2,3]Satisfy constraints: [1]Satisfy constraints: [2]Satisfy constraints: [3]Choose “No Annotation”

Page 41: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

41

Halting Condition

Ideal Case – No variable node receives a ‘CHANGE’ message

Practical Case – Fraction of variable nodes that receive ‘CHANGE’ message <

thresholdChange

Page 42: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

42

Tables Ontology

dbpedia-owl:BasketBallTeam

dbpedia:Michael_Jordan

dbpedia-owl:playsFor

Page 43: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

43

RDF Linked Data Representation

tab:cell_01 a tab:ColumnHeader; tab:cellLabel "Name"^^xsd:String; tab:columnIndex "1"^^xsd:Integer; tab:valueType dbpedia-owl:BasketballPlayer.

tab:cell_11 a tab:DataCell; tab:cellLabel "Michael Jordan"^^xsd:String;

tab:columnIndex "1"^^xsd:Integer; tab:rowIndex "1"^^xsd:Integer; tab:entity dbpedia:Michael_Jordan.

tab:HeaderRelation_12 a tab:TableRelation; tab:relFromColumn tab:cell_01; tab:relToColumn tab:cell_02; tab:relLabel dbpedia-owl:team.

Page 44: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

44

Human in the loopAAD DECODE

Generate RDF Linked

Data

Verify (optional)

Store / Publish

Query and Rank

2 1

Joint Inference

Name Team Position Height

Michael Jordan Chicago Shooting Guard

1.98

Allen Iverson Philadelphia Point Guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power Forward 2.11

AAD DECODE

Joint Inference

Generate RDF Linked

Data

Verify (optional)

Store / Publish

During

After

Before

Before

Page 45: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

45

Human in the loop – Before

No. Name Team Position Height

1 Michael Jordan

Chicago Shooting Guard

1.98

2 Allen Iverson Philadelphia

Point Guard 1.83

3 Yao Ming Houston Center 2.29

4 Tim Duncan San Antonio

Power Forward

2.11

Page 46: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

46

Human in the loop – Before

Team

WomenArtistBasketBallTeamCityPopulatedPlaceSportsTeam….….

Michael Jordan

Michael_I_JordanMichael_JordanMichael_JacksonMichael_Wodruff….….….

Name, Team

livesInteam….….….….….

Assignments treated as “true values”

Page 47: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

Human in the loop – During

47

𝝍𝟑 𝝍𝟒

Team [0.2] Name, Team [0.1]

WomenArtistBasketBallTeamCitySportsTeam….….

Page 48: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

48

Human in the Loop – Impact on Joint Inference

Name

Michael_I_Jordan

Allen_Iverson

Yao_Ming

Change

No - Change

BasketBallPlayer

𝝍𝟑 Name [BasketballPlayer]

[Class: BasketBallPlayer, 1.0] [Fixed][Relation: playsFor, 0.6]

R11

Michael Jordan

Name,Team [playsFor]

𝝍𝟒

[Class: BasketBallPlayer, 0.8][Relation: playsFor, 1.0] [Fixed]

Name

Michael_I_Jordan

Allen_Iverson

Yao_Ming

Change

No - Change

playsFor

Page 49: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

49

Human in the Loop – Impact on Joint Inference

R11 Chicago [Chicago_Bulls]

WomenArtistBasketBallTeamCityPopulatedPlaceSportsTeam….….

livesInteam….….….….….

Candidate classes

Candidate relations

Page 50: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

50

Why How Evaluation

Application

Wrap up

Page 51: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

51

Datasets

Dataset # of tables used in Col. And Rel Annotations

# of tables used in Data Cell Annotations

Average number of columns and rows

Web_Manual 150 371 2, 36

Web_Relation 28 – 4, 67

Wiki_Manual 25 39 4, 35

Wiki_Links – 80 3, 16

Subset of the IIT-B datasetLimaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and

searching web tables using entities, types and relationships. In: Proc. 36th VLDB (2010)

Page 52: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

52

Ground TruthHuman annotators marked each class, relation as ‘vital’, ‘okay’, ‘incorrect’

To compute precision, assign scores to class & relation predicted by the system 1 – If the class was vital 0.5 – If the class was okay, but could have been better (e.g.

Place v/s City) 0 – if it was incorrect

To compute recall assign score of 1 if vital or okay, 0 for incorrect

Ground truth for data cell value annotations from the IIT – B dataset

Page 53: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

Column Header Annotations

530.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

55.90

36.17

60.00

24.31

47.87

18.57

Web_Manual

Web_Relation

Wiki_Manual

<--

---

Perc

enta

ge -

----

>

okay

vital

% of Relevant labels at Rank 1

Page 54: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

54

Column Header Annotations% of Relevant labels at different ranks

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

<--

---

Perc

enta

ge

----

->

<----- Rank ----->

Web_Manual

Web_Relation

Wiki_Manual

Page 55: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

55

Column Header Annotations

Web_Manual Web_Relation Wiki_Manual0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.68

0.60

0.69

0.49

0.42

0.530.57

0.49

0.60

Precision, Recall and F-score at rank 1

PrecisionRecall

F-score

Page 56: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

56

Column Header Annotations

1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Precision RecallWeb_ManualWeb_RelationWiki_Manual

<----- Rank (k) ----->

Precision v/s Recall at ranks 1-10

Page 57: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

57

Column Header Annotations

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.570.6

0.43

0.56

0.65 0.67

SMP

IIT-B

GOOG SMP

IIT-B

GOOGWeb_Manual Wiki_Manual

Semantic Message Passing v/s the restF-scores

Page 58: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

58

Example Column Header Predictions

Column: ConstituencyPredicted: N.A.

DBpedia classes [Ranks 2-10]:OfficeHolderPrimeMinisterPoliticianElectionEventAdministrativeRegionPopulatedPlaceUniversityEducationalInstitution

Column: Name of Elected M.P.Predicted: OfficeHolder

DBpedia classes [Ranks 2-10]: ElectionEventPrimeMinisterPoliticianCountryPopulatedPlaceSettlementUniversityEducationalInstitution

Page 59: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

59

Relation Annotations

Web

_Man

ual_d

bp

Web

_Man

ual_y

ago

Web

_Relat

ion_

dbp

Web

_Relat

ion_

yago

Wiki

_Man

ual_d

bp

Wiki

_Man

ual_y

ago

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

34.43 33.33

50.00 50.00 53.85 66.67

3.28

6.25

okay

vital

<--

---

Perc

enta

ge -

----

>

% of relevant relations at rank 1

Page 60: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

60

Relation Annotations

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

Web_Manual_dbp Web_Manual_yago

Web_Relation_dbp Web_Relation_yago

Wiki_Manual_dbp Wiki_Manual_yago

Web_Manual

Web_RelationWiki_Manual

DBpedia

Yago

<----- Rank ----->

<--

---

Perc

enta

ge -

----

>

% of relevant relations at rank 1-10

Page 61: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

61

Relation Annotations

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.000.89 0.86

0.97

0.51

0.630.68

SMP

IIT-B

IIT-B

IIT-BWeb_Manual Wiki_ManualWeb_Relatio

n

SMP

SMP

Semantic Message Passing v/s the restF-scores

Page 62: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

62

Example Relation PredictionsColumn: President – Birth statePredicted: N.A.

DBpedia rels [Ranks 2-10]:locationdeathPlacelocatedInAreabirthPlaceisPartOflargestCityalmaMaterregionstate

Column Pair: Name of Elected M.P. -- Party Affiliation

Predicted: party

DBpedia rels [Ranks 2-8]: affiliationotherPartyprimeMinisterdeathPlacebirthPlaceregionNA

Page 63: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

63

Data Cell Value Annotations

Wiki_Link Web Manual Wiki_Manual0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00 75.89

63.0767.42

% of correctly linked entities

Page 64: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

64

How long did it run ?

0 2 4 6 8 10 120

50

100

150

200

250

300

Iteration Number

Num

ber

of

vari

able

s th

at

rece

ive

mess

age c

hange

Line represents a table

Number of variables that received a “change” message at the end of a iteration

Page 65: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

65

Literals – Experimental Setup

Subset of 16 tables [17 literal value columns] from the Wiki_Link Dataset

Generate property candidate set by querying against NumKB

Manually annotated each literal column with an appropriate DBpedia property

Page 66: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

66

Header Cell Annotations for Literals

1 2 3 4 5 6 7 8 9 100.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

35.29

23.53

17.65

5.88 5.88

0.00 0.00

5.88

0.00 0.00

<----- Rank ----->

<--

---

Perc

enta

ge -

----

>

Percentage of correct properties at ranks 1-10

1 2 3 4 5 6 7 8 9 100.00

10.00

20.00

30.00

40.00

50.00

60.00

70.0064.71

23.53

0.00 0.00

5.88

0.00 0.00

5.88

0.00 0.00

Page 67: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

67

Human in the loop – Experimental Setup

Subset of 11 tables from the Wiki_Link dataset

User feedback: Correct column header class [1 column in 9 tables and 2 for the remaining 2 tables]

Rest of the experimental setup same.

Page 68: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

68

Data Cell Annotations

1 2 3 4 5 6 7 8 9 10 110.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

No HILHIL

Human in the Loop (HIL) v/s No Human in the Loop<

----

- %

of

corr

ect

ly a

nnota

ted d

ata

cells

----

>

<----- Table number----->

Correct Entities

Total %

HIL 286 402 71.14

No – HIL 245 402 60.95

Page 69: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

69

Why How Evaluation

Application

Wrap up

Page 70: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

70

Interpreting Medical Tables as Linked Datafor Generating Meta–Analysis Reports

Page 71: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

71

TABEL – TABle Extracted as Linked Data

AAD DECODE

Pre-processing modules

Query and Rank

2 1

Generate RDF Linked

Data

Verify (optional)

Store / Publish

Joint Inference

Name Team Position Height

Michael Jordan Chicago Shooting Guard

1.98

Allen Iverson Philadelphia Point Guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power Forward 2.11

Your module here!Normalize

Varish Mulwad, Tim Finin and Anupam Joshi, "Interpreting Medical Tables as Linked Data to Generate Meta–Analysis Reports", In 15th IEEE Int. Conf. on Information Reuse and Integration (IRI 2014), San Francisco, USA, Aug. 2014.

Page 72: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

72

Preprocessing – Normalize

Page 73: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

73

Preprocessing – Normalize

Patients with Secondary Thrombosis

N = 146

no. --> 49; % -->33.6

no. (%)

Smoker

Split header cells into Query String and Metadata

Normalize data cells; identify types or units

Page 74: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

74

Query – Candidate Classes* [DBpedia]

Hypertension

(1) Idiopathic intracranial hypertension(2) Pulmonary hypertension(3) Hypertension

(1) Idiopathic intracranial hypertension(2) Pulmonary hypertension(3) Hypertension

Re-rank – Classifier(String Similarity, Popularity)

(1) Hypertension(2) Pulmonary hypertension(3) Idiopathic intracranial hypertension

Also evaluated against SNOMED CT & UMLS

Page 75: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

75

Query – Candidate Classes [Hybrid]

Hypertension

(1) Hypertension(2) Pulmonary hypertension(3) Idiopathic intracranial hypertension

No results?

SNOMED CT

(1) Hypertension(2) Pulmonary hypertension(3) Idiopathic intracranial hypertension

API

Page 76: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

76

Modeling Medical Tables as RDF

PatientGroup

xsd:integer owl:Thing

numberOfIndividuals

hasGroupAttribute

146

umls:Secondary_Thrombosis

Value

xsd:String

hasType

xsd:double

hasRawValue

% 33.6

Page 77: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

77

Interactive tool to generate Meta – Analysis reports

User interface to define meta-analysis parameters and select studies

Tool automatically generates relevant SPARQL queries

Page 78: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

78

Evaluation

Page 79: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

79

Header Cell Annotations

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

29.51

60.66 59.02

29.51

10.66

8.2

10.66 20.49

4.1

2.46 4.92

3.28

3.28 2.46

4.1

4.92

18.85

4.1 9.02

16.39

33.61

22.13

12.3

25.41

Distribution of header cell concepts at different ranks

SNOMED CT UMLS

HYBRID

DBPEDIA

<--

---

Perc

enta

ge -

----

>

NF: Correct concept not found in the candidate set

1 2-5 6-1011-25

26-101

NF 1 2-5 6-1011-25

26-101

NF 1 2-5 6-1011-25

26-101

NF 1 2-5 6-1011-25

26-101

NF

Dataset: 7 tables (122 header cells)

Page 80: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

80

Retrieval (Find) Evaluation Experimental Setup

• Generated Linked Data from four tables

• Executed Retrieval SPARQL queries to find tables that included correlation between venous thrombosis for four different cardio vascular risk factors

• Average Precision: 0.79; Average Recall: 0.75

Page 81: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

81

Why How Evaluation

Application

Wrap up

Page 82: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

82

Conclusions

It is possible to generate high quality linked data from tables by jointly inferring the semantics of column headers, values (string and literal) in table cells, and relations between columns augmented with background knowledge from open data sources such as the Linked Open Data cloud.

I claimed:

’’

Page 83: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

83

Conclusions It is possible to generate high quality linked data from tables by jointly inferring the semantics

TABEL jointly inferred the semantics; thorough evaluation showed promising results

… the semantics of column headers, values (string and literal) in table cells, and relations between columns

A novel technique to generate candidate properties from literal values

Page 84: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

84

Conclusions It is possible to generate high quality linked data from tables

Tables ontology to represent the inferred semantics

Demonstrated domain independence and extensibility and support for tables with different structures

Explored different models for Human in the loop

Page 85: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

85

Future Work

Schema + Data driven approach

Build on the work on inferring literals; NumKB

Further develop Human in the loop

Tool to generate meta-analysis reports

Page 86: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

86

Acknowledgements

Dr. Tim Finin

Dr. Anupam Joshi

Dr. Tim Oates

Dr. Yun Peng

Dr. L V Subramaniam

Dr. Indrajit Bhattacharya

Lab mates & Friends!

Page 87: Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

Thank You! Our papers on this research topic have garnered 93

citations!