26
Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration? Alasdair J G Gray (University of Glasgow now Manchester) Norman Gray (Universities of Leicester and Glasgow) Iadh Ounis (University of Glasgow) ESWC 2009 – Crete 3 June 2009

Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration? Alasdair J G Gray (University of Glasgow now Manchester) Norman Gray

Embed Size (px)

Citation preview

Can RDB2RDF Tools Feasible Expose Large Science Archives

for Data Integration?

Alasdair J G Gray (University of Glasgow now Manchester)

Norman Gray (Universities of Leicester and Glasgow)

Iadh Ounis (University of Glasgow)

ESWC 2009 – Crete3 June 2009

A.J.G. Gray - ESWC 2009 2

Outline

• Motivation: The Virtual Observatory• Can SPARQL be used to express scientific

queries?• Can existing archives be exposed with

semantic tools?– Can RDB2RDF tools extract large volumes of data?

3 June 2009

A.J.G. Gray - ESWC 2009 3

International Virtual Observatory Alliance

“facilitate the international coordination and collaboration necessary for the development and deployment of the tools, systems and organizational structures necessary to enable the international utilization of astronomical archives as an integrated and interoperating virtual observatory.”

3 June 2009

A.J.G. Gray - ESWC 2009 4

Searching for Brown Dwarfs

• Data sets:– Near Infrared, 2MASS/UK Infrared Deep Sky

Survey– Optical, APMCAT/Sloan Digital Sky Survey

• Complex colour/motion selection criteria• Similar problems

– Halo White Dwarfs

3 June 2009

A.J.G. Gray - ESWC 2009 5

Deep Field Surveys

• Observations in multiple wavelengths– Radio to X-Ray

• Searching for new objects– Galaxies, stars, etc

• Requires correlations across many catalogues– ISO– Hubble– SCUBA– etc

3 June 2009

A.J.G. Gray - ESWC 2009 6

The Problem

Locate and combine relevant data

• Heterogeneous publishers– Archive centres– Research labs

• Heterogeneous data– Relational– XML– Files

Virtual Observatory

3 June 2009

A.J.G. Gray - ESWC 2009 7

A Data Integration Approach

• Heterogeneous sources– Autonomous – Local schemas

• Homogeneous view– Mediated global schema

• Mapping– LAV: local-as-view– GAV: global-as-view

3 June 2009

Global Schema

Query1 Queryn

DB1

Wrapper1

DBk

Wrapperk

DBi

Wrapperi

Mappings

Relies on agreement of a common global

schema

A.J.G. Gray - ESWC 2009 8

P2P Data Integration Approach

• Heterogeneous sources– Autonomous – Local schemas

• Heterogeneous views– Multiple schemas

• Mappings– From sources to common

schema– Between pairs of schema

• Require common integration data model

Can RDF do this?3 June 2009

Schema1

DB1

Wrapper1

DBk

Wrapperk

DBi

Wrapperi

Schemaj

Query1 Queryn

Mappings

A.J.G. Gray - ESWC 2009 9

Resource Description Framework

• W3C standard• Designed as a metadata

data model• Contains semantic

details• Ideal for linking

distributed data• Queried through

SPARQL

3 June 2009

#foundIn

#Sun

The Sun#name

#MilkyWay

Milky Way#name

The Galaxy#name

IAU:Starrdf:type

IAU:BarredSpiral

rdf:type

A.J.G. Gray - ESWC 2009 10

SPARQL

• Declarative query language– Select returned data

• Graph or tuples• Attributes to return

– Describe structure of desired results– Filter data

• W3C standard• Syntactically similar to SQL

– Should be easy for scientists to learn!

3 June 2009

A.J.G. Gray - ESWC 2009 11

Integrating Using RDF

• Data resources– Expose schema and data

as RDF– Need a SPARQL endpoint

• Allows multiple – Access models– Storage models

• Easy to relate data from multiple sources

Relational DB

RDF / Relational

Conversion

XML DB

RDF / XML Conversion

Common Model (RDF)

Mappings

SPARQLquery

3 June 2009

We will focus on exposing relational data sources

A.J.G. Gray - ESWC 2009 12

RDB2RDF

Extract-Transform-Load• Data replicated as RDF

– Data can become stale

• Native SPARQL query support– Limited optimisation

mechanisms

Existing RDF stores• Jena• Seasame

Query-driven Conversion• Data stored as relations

• Native SQL query support– Highly optimised access

methods

• SPARQL queries must be translated

Existing translation systems• D2RQ• SquirrelRDF

3 June 2009

A.J.G. Gray - ESWC 2009 13

System Hypothesis

Is it viable to perform query-driven conversions to facilitate data access from a data model that a scientist is familiar with?

Can RDB2RDF tools feasibly expose large science archives for data integration?

Relational DB

RDB2RDF

XML DB

RDF / XML Conversion

Common Model (RDF)

Mappings

SPARQLquery

3 June 2009

SPARQLquery

A.J.G. Gray - ESWC 2009 14

Astronomical Test Data Set

• SuperCOSMOS Science Archive (SSA)– Data extracted from scans of Schmidt plates– Stored in a relational database– About 4TB of data, detailing 6.4 billion objects– Fairly typical of astronomical data archives

• Schema designed using 20 real queries• Personal version contains

– Data for a specific region of the sky

– About 0.1% of the data– About 500MB

3 June 2009

A.J.G. Gray - ESWC 2009 15

Analysis of Test Data

• Using personal version– About 500MB in size (similar size to related work)

• Organised in 14 Relations– Number of attributes: 2 – 152

• 4 relations with more than 20 attributes

– Number of rows: 3 – 585,560– Two views

• Complex selection criteria in views

3 June 2009

Makes this different from business cases and previous work!

A.J.G. Gray - ESWC 2009 16

Is SPARQL expressive enough?

Can the 20 sample queries be expressed in SPARQL?

3 June 2009

A.J.G. Gray - ESWC 2009 17

Real Science QueriesQuery 5: Find the positions and (B,R,I) magnitudes of all star-like objects within delta mag of 0.2 of the colours of a quasar of redshift 2.5 < z < 3.5SQL:SELECT ra, dec, sCorMagB,

sCorMagR2, sCorMagIFROM ReliableStarsWHERE (sCorMagB-

sCorMagR2 BETWEEN 0.05 AND 0.80) AND (sCorMagR2-sCorMagI BETWEEN -0.17 AND 0.64)

SPARQL:SELECT ?ra ?decl

?sCorMagB ?sCorMagR2 ?sCorMagI

WHERE {<bindings>FILTER (?sCorMagB –

?sCorMagR2 >= 0.05 && ?sCorMagB - ?sCorMagR2 <= 0.80)

FILTER (?sCorMagR2 – ?sCorMagI >= -0.17 && ?sCorMagR2 - ?sCorMagI <= 0.64)}

3 June 2009

A.J.G. Gray - ESWC 2009 18

Analysis of Test Queries

Query Feature Query Numbers

Arithmetic in body 1-5, 7, 9, 12, 13, 15-20

Arithmetic in head 7-9, 12, 13

Ordering 1-8, 10-17, 19, 20

Joins (including self-joins) 12-17, 19

Range functions (e.g. Between, ABS) 2, 3, 5, 8, 12, 13, 15, 17-20

Aggregate functions (including Group By) 7-9, 18

Math functions (e.g. power, log, root) 4, 9, 16

Trigonometry functions 8, 12

Negated sub-query 18, 20

Type casting (e.g. Radians to degrees) 7, 8, 12

Server defined functions 10, 11

3 June 2009

A.J.G. Gray - ESWC 2009 19

Expressivity of SPARQL

Features• Select-project-join• Arithmetic in body• Conjunction and disjunction• Ordering• String matching• External function calls

(extension mechanism)

Limitations• Range shorthands• Arithmetic in head• Math functions• Trigonometry functions• Sub queries• Aggregate functions• Casting

3 June 2009

A.J.G. Gray - ESWC 2009 20

Analysis of Test Queries

Query Feature Query Numbers

Arithmetic in body 1-5, 7, 9, 12, 13, 15-20

Arithmetic in head 7-9, 12, 13

Ordering 1-8, 10-17, 19, 20

Joins (including self-joins) 12-17, 19

Range functions (e.g. Between, ABS) 2, 3, 5, 8, 12, 13, 15, 17-20

Aggregate functions (including Group By) 7-9, 18

Math functions (e.g. power, log, root) 4, 9, 16

Trigonometry functions 8, 12

Negated sub-query 18, 20

Type casting (e.g. radians to degrees) 7, 8, 12

Server defined functions 10, 11

Expressible queries: 1, 2, 3, 5, 6, 14, 15, 17, 19 3 June 2009

A.J.G. Gray - ESWC 2009 21

Can RDB2RDF tools feasibly expose large science archives for data integration?

3 June 2009

A.J.G. Gray - ESWC 2009 22

Experiment

• Time query evaluation– 5 out of 20 queries used– No joins

• Systems compared:– Relational DB (Base line)

• MySQL v5.1.25

– RDB2RDF tools• D2RQ v0.5.2• SquirrelRDF v0.1

– RDF Triple stores• Jena v2.5.6 (SDB)• Sesame v2.1.3 (Native)

3 June 2009

Relational DB

RDB2RDF

SPARQLquery

Triple store

SPARQLquery

Relational DB

SQLquery

A.J.G. Gray - ESWC 2009 23

Experimental Configuration

• 8 identical machines– 64 bit Intel Quad Core Xeon 2.4GHz– 4GB RAM– 100 GB Hard drive– Java 1.6– Linux

• 10 repetitions

3 June 2009

A.J.G. Gray - ESWC 2009 24

Performance Results

3 June 2009

# Query 1 # Query 2 # Query 3 # Query 5 # Query 60

100

200

300

400

500

600

700

800

900

1000

MySQLD2RQSqRDFJenaSesame

ms

3,45

0

5,33

921

,492

485,

932

2,73

3

7,22

9

4,09

01,

307

17,7

93

7,46

819

,984

372,

561

1

A.J.G. Gray - ESWC 2009 25

Conclusions

• SPARQL not expressive enough for real astronomical queries

• RDBMS benefits from 30+ years research– Query optimisation– Indexes

• RDF stores are improving– Require existing data to be replicated

• RDB2RDF tools show promise– Need to exploit relational database

3 June 2009

A.J.G. Gray - ESWC 2009 26

Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration?

Not currently!

We need more work on query translation…

3 June 2009