Upload
accumulo-summit
View
394
Download
10
Tags:
Embed Size (px)
Citation preview
Rya: Optimizations to Support Real Time Graph Queries on AccumuloDr. Caleb Meier, Puja Valiyil, Aaron Mihalik, Dr. Adina Crainiceanu
DISTRIBUTION STATEMENT A. Approved for
public release; distribution is unlimited.
ONR Case Number 43-279-15 JB.01.2015
22
Acknowledgements
This work is the collective effort of:
Parsons’ Rya Team, sponsored by the Department of
the Navy, Office of Naval Research
Rya Founders: Roshan Punnoose, Adina Crainiceanu,
and David Rapp
33
Overview
Rya Overview
Query Execution within Rya
Query Optimizations
Results
Summary
44
Background: Rya and RDF
Rya: Resource Description Framework (RDF)
Triplestore built on top of Accumulo
RDF: W3C standard for representing
linked/graph data
Represents data as statements (assertions) about
resources
– Serialized as triples in {subject, predicate, object}
form
– Example:
• {Caleb, worksAt, Parsons}
• {Caleb, livesIn, Virginia}
Caleb
ParsonsVirginia
worksAtlivesIn
55
Background: SPARQL
RDF Queries are described using SPARQL
SPARQL Protocol and RDF Query Language
SQL-like syntax for finding triples matching
specific patterns
Look for subgraphs that match triple statement patterns
Joins are performed when there are variables common
to two or more statement patterns
SELECT ?people WHERE {
?people <worksAt> <Parsons>.
?people <livesIn> <Virginia>.
}
66
Rya Architecture
Open RDF Interface for interacting with RDF data
stored on Accumulo
Open RDF (Sesame): Open
Source Java framework for
storing and querying RDF
data
Open RDF Provides several
interfaces/abstractions
central for interacting with
a RDF datastore
– SAIL interface for interacting with underlying persisted
RDF model
– SAIL: Storage And Inference Layer
Data storage layer
Query processing in SAIL layer
SPARQL
Rya Open RDF
Rya QueryPlanner
Accumulo
77
Storage: Triple Table Index
3 Tables
SPO : subject, predicate, object
POS : predicate, object, subject
OSP : object, subject, predicate
Store triples in the RowID of the table
Store graph name in the Column Family
Advantages:
Native lexicographical sorting of row keys fast range queries
All patterns can be translated into a scan of one of these tables
88
Overview
Rya Overview
Query Execution within Rya
Query Optimizations
Results
Summary
99
…
worksAt, Netflix, Dan
worksAt, OfficeMax, Zack
worksAt, Parsons, Bob
worksAt, Parsons, Greta
worksAt, Parsons, John
…
Rya Query Execution
Implemented OpenRDF Sesame SAIL API
Parse queries, generate initial query plan, execute plan
Triple patterns map to range queries in Accumulo
SELECT ?x WHERE { ?x <worksAt> <Parsons>.
?x <livesIn> <Virginia>. }
Step 1: POS Table – scan range
…
Bob, livesIn, Georgia
…
Greta, livesIn, Virginia
…
John, livesIn, Virginia
…
Step 2: for each ?x, SPO – index lookup
1010
More Complicated Example of Rya Query Execution
Step 2: For each ?x,
SPO Table lookup
…
Greta, commuteMethod,
bike
…
John, commuteMethod,
Bus
…
Step 3: For each
remaining ?x, SPO
Table lookup
Step 1: POS Table – scan
range for worksAt, Parsons
?x livesIn Virginia?x worksAt Parsons
?x commuteMethod bike
…
worksAt, Netflix, Dan
worksAt, Parsons, Bob
worksAt, Parsons, Greta
worksAt, Parsons, John
worksAt, PlayStation,
Alice
…
…
Bob, livesIn, Georgia
…
Greta, livesIn, Virginia
…
John, livesIn, Virginia
…
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <livesIn> Virginia.
?x <commuteMethod> bike.
}
1111
Challenges in Query Execution
Scalability and Responsiveness
Massive amounts of data
Potentially large amounts of comparisons
Consider the Previous Example:
Default query execution: comparing each “?x” returned from first
statement pattern query to all subsequent triple patterns
There are 8.3 million Virginia residents, about 15,000 Parsons
employees, and 750,000 people who commute via bike.
Only 100 people who work at Parsons commute via bike while 1000
people who work at Parsons live in Virginia.
Poor query execution plans can result in simple queries
taking minutes as opposed to milliseconds
SELECT ?x WHERE {
?x <livesIn> Virginia.
?x <worksAt> Parsons.
?x <commuteMethod> bike.
}
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <livesIn> Virginia.
?x <commuteMethod> bike.
}
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?x <livesIn> Virginia.
}
vs. vs.
1212
Overview
Rya Overview
Query Execution within Rya
Query Optimizations
Results
Summary
1313
Rya Query Optimizations
Goal: Optimize query execution (joins) to better support real time responsiveness
Three Approaches:
Reduce the number of joins: Pattern Based Indices
– Pre-calculate common joins
Limit data in joins: Use more stats to improve query planning
– Cardinality estimation on individual statement patterns
– Join selectivity estimation on pairs of statement patterns
Make joins more efficient: Distribute the Join Processing
– Distribute processing using SPARK SQL or MapReduce
– Use Hash Joins and Intersecting Iterators
– Just beginning to start looking at this
1414
Rya Query Optimizations Using Cardinalities
Goal: Optimize ordering of query execution to
reduce the number of comparison operations
Order execution based on the number of triples that
match each triple pattern
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?x <livesIn> Virginia.
}
8.3M matches
15k matches
750k matches
1515
Rya Cardinality Usage
Maintain cardinalities on the following triple patterns
element combinations: Single elements: Subject, Predicate, Object
Composite elements: Subject-Predicate, Subject-Object,
Predicate-Object
Computed periodically using MapReduce Row ID:
– <CardinalityType><TripleElements>
• OBJECT, Parsons
• PREDICATEOBJECT, worksAt, Parsons
Cardinality stored in the value
Sparse table: Only store cardinalities above a threshold
Only need to recompute cardinalities if the
distribution of the data changes significantly
1616
Limitations of Cardinality Approach
Consider a more complicated query
Cardinality approach does not take into account number of results returned by joins
Solution lies in estimating the “join selectivity” for a each pair of triples
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?vehicle <vehicleType> SUV.
?x <livesIn> Virginia.
?x <owns> ?vehicle.
}
2.1M matches
15k matches
750k matches
8.3M matches
254M matches
1717
Rya Query Optimizations Using Join Selectivity
Query optimized using
only Cardinality Info:
Query optimized using Cardinality
and Join Selectivity Info:
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?vehicle <vehicleType> SUV.
?x <livesIn> Virginia.
?x <owns> ?vehicle.
}
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?x <livesIn> Virginia.
?x <owns> ?vehicle.
?vehicle <vehicleType> SUV.
}
Join Selectivity measures number of results returned by joining two
triple patterns Approach taken from: RDF-3X: a RISC-style Engine for RDF by Thomas
Neumann and Gerhard Weikum in JDMR (formerly Proc. VLDB) 2008
Due to computational complexity, estimate of join selectivity for triple
patterns is pre-computed and stored in Accumulo
Join selectivity estimated by computing the number of results obtained
when each triple pattern is joined with the full table
1818
Join Selectivity: General Algorithm
For statement patterns <?x, p1, o1> and <?x, p2, o2> with ?x a
variable and p1, o1 , p2, o2 constant, estimate the number of results
Sel(<?x, p1, o1> <?x, ?y, ?z>) and Sel(<?x, p2, o2> <?x, ?y, ?z>)
give number of results returned by joining a statement pattern with
the full table along the subject component
Full table join statistics precomputed and stored in index
Join statistics for each triple pattern computed using following equation:
Use analogous definition if variables appear in predicate or object position
Join selectivity statistics used with cardinalities to generate more
efficient query plans
1919
Join Selectivity: Integration into Rya
Join Selectivity estimates used to optimize Rya queries
through a greedy algorithm approach
Query constructed starting with first triple pattern to be
evaluated (the pattern with the smallest cardinality) and then
patterns are added based on minimization of a cost function
Cost function
C = leftCard + rightCard + leftCard*rightCard*selectivity
C measures number of entries Accumulo must scan and the
number of comparisons required to perform the join
Selectivity set to one if two triple patterns share no common
variables, otherwise precomputed estimates used
Ensures that patterns with common variables are grouped
together
2020
Construction of Selectivity Tables
For the pattern <?x, p1, o1>, associate each RDF triple of
the form <c, p1, o1> with the cardinality |<c,?y,?z>| and
then sum the results
Given a triple <c, p1, o1> in the SPO table, Map Job 1 emits
the key-value pair (c, (p1, o1))
Map Job 2 processes the cardinality table and emits the key
value pair (c, |<c,?y,?x>|), which consists of the constant c
and its single component, subject cardinality for the table
Map Job 3 merges the results from jobs 1 and 2 by emitting
the key-value pair ((p1, o1), |<c,?y,?x>|)
Map Job 4 sums the cardinalities from those key-value pairs
containing (p1, o1) as a key, and the result is written to the
selectivity table
2121
Query Optimizations Using Pre-Computed Joins
Reduce joins by pre-computing common joins
Approach taken from: Heese, Ralf, et al. "Index Support for
SPARQL." European Semantic Web Conference, Innsbruck,
Austria. 2007.
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?x <livesIn> Virginia.
?x <owns> ?vehicle.
?vehicle <vehicleType> SUV.
}
Pre-compute using
batch processing
and look up during
query execution
2222
Query Optimizations Using Pre-Computed Joins
Index Result Table
.…
Aaron, ToyotaRav4
Caleb, JeepCherokee
Puja, HondaCRV
.…
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?x <livesIn> Virginia.
?x <owns> ?vehicle.
?vehicle <vehicleType> SUV.
}
SELECT ?person ?car
WHERE {
?person <livesIn> Virginia.
?person <owns> ?car.
?car <vehicleType> SUV.
}
1. Pre-compute a portion of the query
using MapReduce
2. Store SPARQL describing the query
along with pre-computed values in
Accumulo
3. Normalize query variables to match
stored SPARQL variables during
query execution
Stored SPARQL
2323
Overview
Rya Overview
Query Execution within Rya
Query Optimizations
Results
Summary
2424
Query Optimization Results
Ran 14 queries against the Lehigh University Benchmark (LUBM)
dataset (33.34 million triples) LUBM queries 2, 5, 9, and 13 were discarded after 3 runs due to query complexity
– Remaining queries were executed 12 times
Cluster Specs:
– 8 worker nodes, each has 2 x 6-Core Xeon E5-2440 (2.4GHz) Processors and
48 GB RAM
Results indicate that cardinality and join selectivity optimizations provide
improved or comparable performance
2525
Summary
Cardinality estimation and join selectivity can
improve query response times for ad hoc queries
Effects of join selectivity are more apparent for
complex queries over large datasets
Pre-computed joins are extremely useful for
optimizing common queries
Potentially avoid large number of join operations
Maintaining pre-computed join indices is difficult
2626
Questions?
2727
BACK-UP
2828
Useful Links
SPARQL http://www.w3.org/TR/rdf-sparql-query/
http://jena.apache.org/tutorials/sparql.html
RDF http://www.w3.org/TR/2014/NOTE-rdf11-primer-20140225/
Rya https://github.com/LAS-NCSU/rya
– Source on github: Provides documentation and sample client code
– Email Aaron Mihalik ([email protected]) for access (US Citizens only)
Rya Working Group
– Monthly telecon / update on progress, issues, upcoming features
– Email Puja Valiyil [email protected] to join (US Citizens only)
Open RDF Tutorial: http://openrdf.callimachus.net/sesame/tutorials/getting-
started.docbook?view
Open RDF Javadoc: http://openrdf.callimachus.net/sesame/2.7/apidocs/index.html
Punnoose R., Crainiceanu A., Rapp D. 2012. Rya: a scalable RDF triple store for the
clouds. Proceedings of the 1st International Workshop on Cloud Intelligence.
http://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf
Roshan Punnoose, Adina Crainiceanu, David Rapp. SPARQL in the Clouds Using Rya.
Information Systems Journal (2013).
http://www.usna.edu/Users/cs/adina/research/Rya_ISjournal2013.pdf
2929
Next Steps
Maintaining pre-computed join indices
Dynamically determining potential pre-computed
joins
Distributing query planning and execution
SPARK SQL
Rya backed by other datastores
Fully open sourcing Rya
3030
Sample LUBM Queries (1 of 3)
Query 1
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:GraduateStudent .
?X ub:takesCourse <http://www.Department0.University0.edu/GraduateCourse0>}
}
Query 3
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:Publication .
?X ub:publicationAuthor <http://www.Department0.University0.edu/AssistantProfessor0>}
}
3131
Sample LUBM Queries (2 of 3)
Query 7
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X ?Y WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:Student .
?Y rdf:type ub:Course .
?X ub:takesCourse ?Y .
<http://www.Department0.University0.edu/AssociateProfessor0> ub:teacherOf ?Y}
}
Query 8
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X ?Y ?Z WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:Student .
?Y rdf:type ub:Department .
?X ub:memberOf ?Y .
?Y ub:subOrganizationOf <http://www.University0.edu> .
?X ub:emailAddress ?Z}
}
3232
Sample LUBM Queries (3 of 3)
Query 9
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X ?Y ?Z WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:Student .
?Y rdf:type ub:Faculty .
?Z rdf:type ub:Course .
?X ub:advisor ?Y .
?Y ub:teacherOf ?Z .
?X ub:takesCourse ?Z}
}
Query 11
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:ResearchGroup .
?X ub:subOrganizationOf <http://www.University0.edu>}
}