17
Emir Muñoz Fujitsu (Ireland) Limited National University of Ireland Galway LD4IE 2014 @ ISWC, Riva del Garda, Trentino, Italy. Oct 20 th , 2014 http://bit.ly/1xYTR6Z (@emir_munoz)

Learning Content Patterns from Linked Data

Embed Size (px)

Citation preview

Page 1: Learning Content Patterns from Linked Data

Emir Muñoz

Fujitsu (Ireland) Limited

National University of Ireland Galway

LD4IE 2014 @ ISWC, Riva del Garda, Trentino, Italy. Oct 20th, 2014

http://bit.ly/1xYTR6Z

(@emir_munoz)

Page 2: Learning Content Patterns from Linked Data

2

Page 3: Learning Content Patterns from Linked Data

<subject, predicate, object>

Domain(predicate) ??

Range(predicate) ??

3

Page 4: Learning Content Patterns from Linked Data

select distinct ?obj where

{?sub <http://dbpedia.org/property/isbn> ?obj}

Let’s run the following SPARQL query over endpoint…

And some more ...

The endpoint response is a table with the values for the isbn property:

So, what is the correct range for ? 4

0

71090

6176526

2

2.7073

140043853

1107020697

2940013968264

0978-02-02+02:00

http://dbpedia.org/resource/N/a

"?"@en

"ISBN 0-312-85182-0"@en

"See text"@en

"various"@en

"ISBN 978-0-465-02656-2, ISBN 0-14-017997-6"@en

"ISBN 0-553-07875-5 & ISBN 0-553-56166-9"@en

"The Claiming of Sleeping Beauty: ISBN 0-452-26656-4"@en

"-2.0"^^<http://dbpedia.org/datatype/second>

"TBA"@en

"not available"@en

"[[#Bibliography"@en

Page 5: Learning Content Patterns from Linked Data

LOV Statistics (by July 7th, 2014):

446 vocabularies

10 classes and 20 properties in average

5

range of isbn is

http://schema.org/Text

Page 6: Learning Content Patterns from Linked Data

…but still, is it what I’m looking for? what is the syntax? 6

Page 7: Learning Content Patterns from Linked Data

Etymology apo- + apsis

Noun apoapsis (plural apoapsides)

(astronomy) The point of a body's elliptical orbit about the system's centre of mass where the distance between the body and the centre of mass is at its maximum.

Property: apoapsis

[http://en.wiktionary.org/wiki/apoapsis]

Earth

Satellite

dbr:17049_Miron dbo:apoapsis 4.01288e+11

7

Page 8: Learning Content Patterns from Linked Data

8

https://github.com/dbpedia/extraction-framework/blob/master/

core/src/main/scala/org/dbpedia/extraction/ontology/OntologyDatatypes.scala

Page 9: Learning Content Patterns from Linked Data

<subject, predicate, object>

1488-07-28+02:00 "September 2012"@en "--08-26+02:00"^^<http://www.w3.org/2001/XMLSchema#gMonthDay>

1982-05-23+02:00 "August 2012"@en "--01-24+02:00"^^<http://www.w3.org/2001/XMLSchema#gMonthDay>

2007-04-11+02:00 "July 2009"@en "--06-11+02:00"^^<http://www.w3.org/2001/XMLSchema#gMonthDay>

Lerman et al. (JAIR 2003)

First column: [NUM-NUM-NUM+NUM:NUM] (plain literal)

Second column: [ALPHA<space>NUM] (plain literal + lang)

Third column: [--NUM-NUM+NUM:NUM] (typed literal)

<http://dbpedia.org/property/date>

9

Page 10: Learning Content Patterns from Linked Data

Let be the set of

content patterns.

Lerman et al. (JAIR 2003)

More specific categories

For the input set:

That generates the following patterns:

Values are decomposed in tokens, and

each token is represented by a syntactic

class.

10

Page 11: Learning Content Patterns from Linked Data

2.4 billion RDF triples

53,230 properties

Version 3.9

Split

Method

19.25% plain literals

18.02% typed literals

62.73% without lang or datatype (xsd:string)

11

Page 12: Learning Content Patterns from Linked Data

For apoapsis example, we extracted one pattern

And we also found some other related properties:

For date example, we extracted 7 patterns

http://dbpedia.org/ontology/apoapsis LARGE/FLOAT_NUMBER 1.0

http://dbpedia.org/ontology/Planet/apoapsis LARGE/FLOAT_NUMBER 1.0

http://dbpedia.org/ontology/Spacecraft/apoapsis LARGE/FLOAT_NUMBER 1.0

http://dbpedia.org/property/apoapsis NUMBER 0.9230769230769231

http://dbpedia.org/property/apoapsis LARGE/FLOAT_NUMBER 0.75213675

http://dbpedia.org/property/date -- SMALL_NUMBER - SMALL_NUMBER 0.2

http://dbpedia.org/property/date ALPHANUMERIC MEDIUM_NUMBER 0.166

http://dbpedia.org/property/date ALPHANUMERIC 2012 0.032

http://dbpedia.org/property/date ALPHANUMERIC.ALPHANUMERIC 0.012

And more …

12

Page 13: Learning Content Patterns from Linked Data

The user has this value: “2014-10-20”.

What property can he use? dbp:dateCreated, dbp:dateOfProduction, dbp:dateOpened,

dbp:dateSigned, dbp:dateOfPremiere, dbp:date, among others.

What is the property dbp:admCtrOf used for?

"town of republic significance of Meleuz"@en (http://dbpedia.org/resource/Meleuz)

"town of oblast significance of Oktyabrsk"@en (http://dbpedia.org/resource/Oktyabrsk)

"town of republic significance of Sortavala"@en (http://dbpedia.org/resource/Sortavala)

it is used to declare Administrative Control Of

13

Page 14: Learning Content Patterns from Linked Data

Check for atypical values (outliers) Close look into the most (in)frequent patterns

Possible errors during automatic extraction

For the dbp:isbn property we can find the following values:

"summer or autumn 380"@en "Late November"@en

"Fall 1040"@en 680

"December, 67 BC"@en "April-July 1799"@en

http://dbpedia.org/resource/New_Year's_Day http://dbpedia.org/resource/Second_Interm

ediate_Period_of_Egypt

"New moon day of Kartika, celebrations begin two

days prior and end two days after that date"@en

Are they or values? 14

Page 15: Learning Content Patterns from Linked Data

E-mail: [email protected]

Given name: John

Surname: Snow

Birthday: 1986-02-14

A vCard, may be annotated

with microformat hCard

LD4IE Challenge

2014

vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . ALL_LOWERCASE 0.82

vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . com 0.69

vcard:email mailto : ALPHA @ ALPHANUMERIC . ALL_LOWERCASE 0.54

vcard:email mailto : ALPHA @ ALPHANUMERIC . com 0.46

vcard:email mailto : ALL_UPPERCASE ****@ ALL_LOWERCASE . ALL_LOWERCASE 0.36

We can use our database to extract and validate the email:

vcard:bday NUMBER - SMALL_NUMBER - SMALL_NUMBER 0.5

vcard:bday MEDIUM_NUMBER - SMALL_NUMBER - SMALL_NUMBER 0.5

…also the birthday

15

Page 16: Learning Content Patterns from Linked Data

Extraction of lexico-syntactic patterns from LD datasets

Different use cases:

Search for properties

Validation of values

Information extraction based on patterns

Future work:

Study of consistency analysis of knowledge bases

Extension of patterns to cover other knowledge bases

Among others

16

500,000 content patterns

Page 17: Learning Content Patterns from Linked Data

http://emunoz.org

@emir_munoz

[email protected] https://github.com/emir-munoz/ld-patterns/