Upload
shubha
View
22
Download
0
Embed Size (px)
DESCRIPTION
- PowerPoint PPT Presentation
Citation preview
®
1
Distributed Storage and Querying Techniques fora Semantic Web of Scientific Workflow Provenance
The ProvBase System
Artem Chebotko(joint work with John Abraham, Pearl Brazier, Jaime Navarro, and Anthony
Piazza)Department of Computer Science
University of Texas – Pan [email protected]
http://www.cs.panam.edu/~artemJuly 8th, 2010
22
2
Background
Semantic Web: Web of Data
Machine-processable semantic data - metadata that describes resources and relationships among them
Semantic Web standards: RDF, RDFS, OWL, SPARQL
Scientific Workflows & Provenance: Powerful paradigm for formalizing and automating complex and data intensive
scientific processes
In-silico experiments, e-science
Provenance: metadata that captures the origin and derivation history of data products
Scientific discovery reproducibility, result interpretation, and problem diagnosis primarily depend on provenance
Semantic Web of Scientific Workflow Provenance: Semantic Web technologies, scientific workflow provenance, interoperability, and
integration
33
3
Motivation, Goals, Challenges
Scientific workflows generate a lot of provenance A scientific workflow can be executed numerous times with different settings,
parameters, and inputs to obtain interesting results
The TangoInSilico workflow designed in VIEW has over 20 different parameters and can generate around 500 RDF triples every 3 seconds. That is >14 million triples per day!
Provenance from different projects can be integrated The Open Provenance Model (http://openprovenance.org) and the Third Provenance
Challenge (http://twiki.ipaw.info)
There exist a growing need for efficient database systems that can employ distributed storage and querying techniques to cope with large-scale provenance data management Most existing solutions assume a single-machine deployment
44
4
Motivation, Goals, Challenges
Shared-disk and shared-nothing clustering
Google’s Bigtable is an eye catcher! Caches most data on the Web; 16+ billions of webpages according to
http://www.worldwidewebsize.com
Has open-source implementation HBase (http://hbase.apache.org); HBase builds on top of Hadoop (http://hadoop.apache.org)
Java framework that supports intensive data communication among computers in a cluster
Capable to connect and coordinate thousands of nodes inside a cluster
Distributes data to obtain the best performance
A scalable, distributed database that supports structured data storage for large tables
Not a relational database; "a sparse, distributed multi-dimensional sorted map"
55
5
Motivation, Goals, Challenges
Goal: Store and query scientific workflow provenance (in RDF) using HBase
Challenges Data partitioning in HBase is based on row keys; RDF triples have no keys.
A table cell can contain a set of values with different timestamps. A relationship between two values from different cell sets in different columns but for the same row key is not captured.
No high-level, declarative query language like SQL; a simple API instead.
Querying is based on row keys.
What database schema is suitable for storing RDF triples to efficiently support triple pattern matching?
How can SPARQL queries be evaluated against an HBase database?
66
6
Contributions
Architect the ProvBase system that incorporates an HBase/Hadoop backend for distributed storage and querying of provenance triples
Design a three-table storage schema that can be instantiated in HBase to hold provenance triples
Explore querying algorithms to evaluate SPARQL queries in HBase using its native API
Conduct an experimental study using the Third Provenance Challenge queries
77
7
Organization of This Talk
Related Work
ProvBase Architecture
Storage Schema
Querying Algorithms
Performance Study
Concluding Remarks & Future Work
88
8
Related Work: Scientific Workflow Provenance Data Management
99
9
Related Work: Distributed RDF Data Management
Heart (Highly Extensible & Accumlative RDF Table); http://rdf-proj.blogspot.com and http://heart.korea.ac.kr
SPIDER (http://dbserver.korea.ac.kr/projects/spider/)
M. F. Husain, P. Doshi, L. Khan, and B. M. Thuraisingham, “Storage and retrieval of large RDF graph using Hadoop and MapReduce,” in Proc. of CLOUD, 2009, pp. 680–686.
M. F. Husain, L. Khan, M. Kantarcioglu, and B. M. Thuraisingham, “Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools,” in Proc. of CLOUD, 2010.
J. Urbani, S. Kotoulas, E. Oren, F. van Harmelen, “Scalable Distributed Reasoning Using MapReduce,” in Proc. of ISWC, 2009, pp. 634-649.
RDFCube and RDFPeers
More related works in the paper
1010
10
ProvBase Architecture
H R egio nS erver
H R egio nS erver
H R egio nS erver
H B as eM as ter
Prov
enan
ce D
ata
Mid
dlew
are
C lients
Prov
enan
ceC
olle
ctio
nan
d Q
uery
ing
P ro vB as eS ervers
Clients collect and query provenance
ProvBase servers process all client requests
An active master coordinates an HBase cluster
Region servers store provenance
1111
11
Storage Schema
An HBase table has rows and columns. A row is uniquely identified by a row key. A table cell can contain a set of values. A cell value has a timestamp.
Figure courtesy of Google, Inc.
R o w
R o w K e y
C ell S et
.. .C o lum n D a ta
T im e S ta m p
C o lum n D a ta
T im e S ta m p
1212
12
Storage Schema
Sample RDF triples<D> <generatedArtifact> <A> .<D> <generatedByProcess> <P> .<C> <usedByProcess> <P> .
When stored in an HBase database, each triple should be searchable by a subject, predicate and object.
Triple pattern <D> ?p ?o matches two triples
Triple pattern ?s <usedByProcess> ?o matches one triple
Triple pattern ?s ?p <P> matches two triples
Other variations, including ?s ?p ?o that should match all the triples
1313
13
Storage Schema
Sample RDF triples<D> <generatedArtifact> <A> .<D> <generatedByProcess> <P> .<C> <usedByProcess> <P> .
Three-table schema Ts – search by subject
Tp – search by predicate
To – search by object
Other considerations Possible to combine Ts, Tp
and To into one table
Tables that allow searching by
both subject and object, subject
and predicate and so forth
Row key hashing
1414
14
Querying Algorithms
Three algorithms in the paper matchTP-T – matching a triple pattern over a triple
matchTP-DB – matching a triple pattern over a database
matchBGP-DB – matching a basic graph pattern over a database
matchTP-T checks that three conditions are satisfied (1) a variable can match anything,
(2) a URI or literal must match itself, and
(3) a variable that occurs more than once must match the same term for all occurrences.
Returns true or false
1515
15
Querying Algorithms
matchTP-DB handles three cases If subject pattern is not a variable, retrieve a row from Ts with the corresponding key
Otherwise, if object pattern is not a variable, retrieve a row from To with the corresponding key
Otherwise, if predicate pattern is not a variable, retrieve a row from Tp with the corresponding key
Otherwise, retrieve all rows from Ts, Tp or To
Each retrieved row contains one or more triples, each of which must be tried to match with an input triple pattern using matchTP-T
matchTP-DB returns a set of matching triples
1616
16
Querying Algorithms
matchBGP-DB steps: Sort triple patterns in the non-descending order of their selectivities, such that triple
patterns that yield smaller results appear first in the list
• (1) if a triple pattern contains only variables, it has the highest selectivity, • (2) if a triple pattern contains a non-variable only at the predicate pattern
position, it has moderate selectivity, and • (3) if a triple pattern has a non-variable at the subject and/or object pattern
positions, it has a low selectivity Evaluate each triple pattern using matchBGP-DB to obtain matching triple sets
Join resulting sets using a nested-loops-like join strategy
• N triple patterns in a basic graph pattern require N loops nested inside of each other
• Intermediate results are not materialized; all the joins are performed concurrently – if triple from the first set joins with a triple from the second set, attempt to join the result with a triple from the third set and so forth
• Joining conditions require that triples must agree on the values (bindings) of shared variables
1717
17
Querying Algorithms
Other SPARQL features: Projection (SELECT)
• Performed on triple pattern or basic graph pattern matching phase Filtering (FILTER)
• Logic connectives, inequality and equality operators, unary predicates, etc.• Performed on triple pattern or basic graph pattern matching phase
Alternative graph patterns (UNION)
• Easy if triple-sets are union-compatible • Need to extend “schemas” otherwise
Optional graph patterns (OPTIONAL)
• Most complicated • Nested optional and parallel optional constructs• Top-down and bottom-up evaluation approaches
1818
18
Performance Study
Algorithms were implemented in Java
Cluster setup 5 nodes, 1 master and ProvBase server, 4 region servers
Gateway E3600 computers, 1.8 GHz Pentium 4 processor, 1 GB RAM, IDE hard drives with 16+ GB of free space, gigabit Ethernet adapter.
D-Link DGS-2208 gigabit switch
Debian operating system 5.0.3, OpenJDK 1.6.0, Hadoop 0.20.1, HBase 0.20.3
Datasets and queries Load Workflow from the 3rd Provenance Challenge
Three queries from the 3rd Provenance Challenge
Tupelo’s OWL vocabulary
Each workflow run generated ~700 RDF triples
1919
19
Performance Study
Datasets
2020
20
Performance Study
Queries
2121
21
Performance Study
Data ingest performance
2222
22
Performance Study
Query performance
2323
23
Performance Study
Query optimization Triple patterns from Q2:
• ?table opm:generatedArtifact p:tableID− Tp returns ~3million triples for the largest dataset− To returns 1 triple
• ?table opm:generatedByProcess ?process− Tp returns millions of triples− Ts can be used only if the binding of ?table is know from the previous
triple pattern, such that ?table opm:generatedByProcess ?process p:table1 opm:generatedByProcess ?process
− Ts returns few triples
Performance of Q2 and Q3 can be substantially improved using this variable substitution technique
2424
24
Concluding Remarks & Future Work
Provenance of 100,000 workflow execution was efficiently stored and queried on a small cluster of commodity machines. Very cost effective.
Future Work Optional graph patterns
Distributing workload among ProvBase servers
Row key hashing
Encoding multiple triple pattern terms or even graph patterns as row keys
Experimental comparison with existing relational and native RDF stores
Experimental comparison with a relational RDF store deployed on a MySQL cluster
Inference
Region size optimizations
2525
25
Acknowledgement
We would like to thank David Kirtley, the Software Systems Specialist for the Department of Computer Science at UTPA, for his assistance with various technical issues occurred in this research.
2626
26
Thank You!
Questions?
Artem Chebotko
Department of Computer ScienceUniversity of Texas – Pan American
[email protected]://www.cs.panam.edu/~artem