July 8 th , 2010

®

1

Distributed Storage and Querying Techniques fora Semantic Web of Scientific Workflow Provenance

The ProvBase System

Artem Chebotko(joint work with John Abraham, Pearl Brazier, Jaime Navarro, and Anthony

Piazza)Department of Computer Science

University of Texas – Pan [email protected]

http://www.cs.panam.edu/~artemJuly 8th, 2010

22

2

Background

Semantic Web: Web of Data

Machine-processable semantic data - metadata that describes resources and relationships among them

Semantic Web standards: RDF, RDFS, OWL, SPARQL

Scientific Workflows & Provenance: Powerful paradigm for formalizing and automating complex and data intensive

scientific processes

In-silico experiments, e-science

Provenance: metadata that captures the origin and derivation history of data products

Scientific discovery reproducibility, result interpretation, and problem diagnosis primarily depend on provenance

Semantic Web of Scientific Workflow Provenance: Semantic Web technologies, scientific workflow provenance, interoperability, and

integration

33

3

Motivation, Goals, Challenges

Scientific workflows generate a lot of provenance A scientific workflow can be executed numerous times with different settings,

parameters, and inputs to obtain interesting results

The TangoInSilico workflow designed in VIEW has over 20 different parameters and can generate around 500 RDF triples every 3 seconds. That is >14 million triples per day!

Provenance from different projects can be integrated The Open Provenance Model (http://openprovenance.org) and the Third Provenance

Challenge (http://twiki.ipaw.info)

There exist a growing need for efficient database systems that can employ distributed storage and querying techniques to cope with large-scale provenance data management Most existing solutions assume a single-machine deployment

44

4


Shared-disk and shared-nothing clustering

Google’s Bigtable is an eye catcher! Caches most data on the Web; 16+ billions of webpages according to

http://www.worldwidewebsize.com

Has open-source implementation HBase (http://hbase.apache.org); HBase builds on top of Hadoop (http://hadoop.apache.org)

Java framework that supports intensive data communication among computers in a cluster

Capable to connect and coordinate thousands of nodes inside a cluster

Distributes data to obtain the best performance

A scalable, distributed database that supports structured data storage for large tables

Not a relational database; "a sparse, distributed multi-dimensional sorted map"

55

5


Goal: Store and query scientific workflow provenance (in RDF) using HBase

Challenges Data partitioning in HBase is based on row keys; RDF triples have no keys.

A table cell can contain a set of values with different timestamps. A relationship between two values from different cell sets in different columns but for the same row key is not captured.

No high-level, declarative query language like SQL; a simple API instead.

Querying is based on row keys.

What database schema is suitable for storing RDF triples to efficiently support triple pattern matching?

How can SPARQL queries be evaluated against an HBase database?

66

6

Contributions

Architect the ProvBase system that incorporates an HBase/Hadoop backend for distributed storage and querying of provenance triples

Design a three-table storage schema that can be instantiated in HBase to hold provenance triples

Explore querying algorithms to evaluate SPARQL queries in HBase using its native API

Conduct an experimental study using the Third Provenance Challenge queries

77

7

Organization of This Talk

Related Work

ProvBase Architecture

Storage Schema

Querying Algorithms

Performance Study

Concluding Remarks & Future Work

88

8

Related Work: Scientific Workflow Provenance Data Management

99

9

Related Work: Distributed RDF Data Management

Heart (Highly Extensible & Accumlative RDF Table); http://rdf-proj.blogspot.com and http://heart.korea.ac.kr

SPIDER (http://dbserver.korea.ac.kr/projects/spider/)

M. F. Husain, P. Doshi, L. Khan, and B. M. Thuraisingham, “Storage and retrieval of large RDF graph using Hadoop and MapReduce,” in Proc. of CLOUD, 2009, pp. 680–686.

M. F. Husain, L. Khan, M. Kantarcioglu, and B. M. Thuraisingham, “Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools,” in Proc. of CLOUD, 2010.

J. Urbani, S. Kotoulas, E. Oren, F. van Harmelen, “Scalable Distributed Reasoning Using MapReduce,” in Proc. of ISWC, 2009, pp. 634-649.

RDFCube and RDFPeers

More related works in the paper

1010

10

ProvBase Architecture

H R egio nS erver

H R egio nS erver

H R egio nS erver

H B as eM as ter

Prov

enan

ce D

ata

Mid

dlew

are

C lients

Prov

enan

ceC

olle

ctio

nan

d Q

uery

ing

P ro vB as eS ervers

Clients collect and query provenance

ProvBase servers process all client requests

An active master coordinates an HBase cluster

Region servers store provenance

1111

11

Storage Schema

An HBase table has rows and columns. A row is uniquely identified by a row key. A table cell can contain a set of values. A cell value has a timestamp.

Figure courtesy of Google, Inc.

R o w

R o w K e y

C ell S et

.. .C o lum n D a ta

T im e S ta m p

C o lum n D a ta

T im e S ta m p

1212

12

Storage Schema

Sample RDF triples<D> <generatedArtifact> <A> .<D> <generatedByProcess> .<C> <usedByProcess> .

When stored in an HBase database, each triple should be searchable by a subject, predicate and object.

Triple pattern <D> ?p ?o matches two triples

Triple pattern ?s <usedByProcess> ?o matches one triple

Triple pattern ?s ?p matches two triples

Other variations, including ?s ?p ?o that should match all the triples

1313

13

Storage Schema

Sample RDF triples<D> <generatedArtifact> <A> .<D> <generatedByProcess> .<C> <usedByProcess> .

Three-table schema Ts – search by subject

Tp – search by predicate

To – search by object

Other considerations Possible to combine Ts, Tp

and To into one table

Tables that allow searching by

both subject and object, subject

and predicate and so forth

Row key hashing

1414

14

Querying Algorithms

Three algorithms in the paper matchTP-T – matching a triple pattern over a triple

matchTP-DB – matching a triple pattern over a database

matchBGP-DB – matching a basic graph pattern over a database

matchTP-T checks that three conditions are satisfied (1) a variable can match anything,

(2) a URI or literal must match itself, and

(3) a variable that occurs more than once must match the same term for all occurrences.

Returns true or false

1515

15

Querying Algorithms

matchTP-DB handles three cases If subject pattern is not a variable, retrieve a row from Ts with the corresponding key

Otherwise, if object pattern is not a variable, retrieve a row from To with the corresponding key

Otherwise, if predicate pattern is not a variable, retrieve a row from Tp with the corresponding key

Otherwise, retrieve all rows from Ts, Tp or To

Each retrieved row contains one or more triples, each of which must be tried to match with an input triple pattern using matchTP-T

matchTP-DB returns a set of matching triples

1616

16

Querying Algorithms

matchBGP-DB steps: Sort triple patterns in the non-descending order of their selectivities, such that triple

patterns that yield smaller results appear first in the list

• (1) if a triple pattern contains only variables, it has the highest selectivity, • (2) if a triple pattern contains a non-variable only at the predicate pattern

position, it has moderate selectivity, and • (3) if a triple pattern has a non-variable at the subject and/or object pattern

positions, it has a low selectivity Evaluate each triple pattern using matchBGP-DB to obtain matching triple sets

Join resulting sets using a nested-loops-like join strategy

• N triple patterns in a basic graph pattern require N loops nested inside of each other

• Intermediate results are not materialized; all the joins are performed concurrently – if triple from the first set joins with a triple from the second set, attempt to join the result with a triple from the third set and so forth

• Joining conditions require that triples must agree on the values (bindings) of shared variables

1717

17

Querying Algorithms

Other SPARQL features: Projection (SELECT)

• Performed on triple pattern or basic graph pattern matching phase Filtering (FILTER)

• Logic connectives, inequality and equality operators, unary predicates, etc.• Performed on triple pattern or basic graph pattern matching phase

Alternative graph patterns (UNION)

• Easy if triple-sets are union-compatible • Need to extend “schemas” otherwise

Optional graph patterns (OPTIONAL)

• Most complicated • Nested optional and parallel optional constructs• Top-down and bottom-up evaluation approaches

1818

18

Performance Study

Algorithms were implemented in Java

Cluster setup 5 nodes, 1 master and ProvBase server, 4 region servers

Gateway E3600 computers, 1.8 GHz Pentium 4 processor, 1 GB RAM, IDE hard drives with 16+ GB of free space, gigabit Ethernet adapter.

D-Link DGS-2208 gigabit switch

Debian operating system 5.0.3, OpenJDK 1.6.0, Hadoop 0.20.1, HBase 0.20.3

Datasets and queries Load Workflow from the 3rd Provenance Challenge

Three queries from the 3rd Provenance Challenge

Tupelo’s OWL vocabulary

Each workflow run generated ~700 RDF triples

1919

19

Performance Study

Datasets

2020

20

Performance Study

Queries

2121

21

Performance Study

Data ingest performance

2222

22

Performance Study

Query performance

2323

23

Performance Study

Query optimization Triple patterns from Q2:

• ?table opm:generatedArtifact p:tableID− Tp returns ~3million triples for the largest dataset− To returns 1 triple

• ?table opm:generatedByProcess ?process− Tp returns millions of triples− Ts can be used only if the binding of ?table is know from the previous

triple pattern, such that ?table opm:generatedByProcess ?process p:table1 opm:generatedByProcess ?process

− Ts returns few triples

Performance of Q2 and Q3 can be substantially improved using this variable substitution technique

2424

24

Concluding Remarks & Future Work

Provenance of 100,000 workflow execution was efficiently stored and queried on a small cluster of commodity machines. Very cost effective.

Future Work Optional graph patterns

Distributing workload among ProvBase servers

Row key hashing

Encoding multiple triple pattern terms or even graph patterns as row keys

Experimental comparison with existing relational and native RDF stores

Experimental comparison with a relational RDF store deployed on a MySQL cluster

Inference

Region size optimizations

2525

25

Acknowledgement

We would like to thank David Kirtley, the Software Systems Specialist for the Department of Computer Science at UTPA, for his assistance with various technical issues occurred in this research.

2626

26

Thank You!

Questions?

Artem Chebotko

Department of Computer ScienceUniversity of Texas – Pan American

[email protected]://www.cs.panam.edu/~artem

Documents

July 8 th , 2010