Scan-Sharing for Optimizing RDF Graph Pattern Matching on MapReduce

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

Scan-Sharing for Optimizing RDF Graph Pattern Matching on MapReduce HyeongSik Kim, Padmashree Ravindra , Kemafor Anyanwu { hkim22, pravind2, kogan }@ ncsu.edu. COUL Semantic CO mp U ting research L ab. Outline. Background RDF Graph Pattern Matching - PowerPoint PPT Presentation

Transcript

Presentation Title Goes Here

Scan-Sharing for Optimizing RDF Graph Pattern Matching on MapReduce

HyeongSik Kim, Padmashree Ravindra, Kemafor Anyanwu{hkim22, pravind2, kogan}@ncsu.edu

COUL Semantic COmpUting research Lab

1OutlineBackgroundRDF Graph Pattern Matching Graph Pattern Matching on MapReduceQueries with Repeated Properties (QRP)Nested Triplegroup Algebra (NTGA)Challenges: Processing QRP with NTGAApproach: TripleGroup CloningWell-formed, Ambiguous, and Perfect TripleGroupsTripleGroup Cloning in TG_GroupFilterEvaluationRelated WorkLet me begin with the background section, and I will explain some background knowledge to understand my research topics such as the Semantic web and its data model RDF and MapReduce and so on.2The Growing Amount of RDF data

May 2007 - # of datasets: 12Sep 2011 - # of datasets:295Growing #RDF triples: currently 31 billion The amount of RDF on the web is rapidly growing.Example: DBPedia (http://dbpedia.org) A dataset extracted from Wikipedia.Contains 1 billion RDF triples.

Linked Data on the web: Now the issue is that the amount of RDF on the web is rapidly growing. For example, there is a dataset called dbpedia which consists of extracted information from Wikipedia and contains around 1 billion RDF triples inside. The amount of the triples published in the web was relatively small around 5 years ago, but now there are 31 billons triples released in the web. Therefore, Scalable RDFquery processing technique is required to process those RDF triples, 3RDF Data Model(Resource Description Framework)How is knowledge represented in the Semantic Web? e.g., Information on mobile device products. Resource Description Framework (RDF) is used.W3Cstandard data model for the Semantic webas Ex. product1 has a name called iphone4 as RDF.Represent information as a form of triple.A subject as product1 A property as name An object as iphone4 (:Product1, :name, :iphone4)Data model is a directed labeled graph.Node: subject, object Labeled edge: property:Producer1:Product1iphone4:name:design:Product2iphone5:name:designwww.apple.com:homepage:price$499:date:dateThe Semantic Web is an extension of the current web, which provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. The Semantic Web stack builds on the Resource Description Framework or RDF, which is a generic data modelfordatainterchange on the Web. RDF basically represents information in the web using triples of the form (subject, property, object). For example, assume that we want to represent the description on mobile device products such as "Product1 hasColor White" in RDF. One way to represent this is that we decompose this statement as a subject "the product1", a property "has the color", and an object "white". After then, each element in RDF triple is replaced with URIs as a unique IDs in the web, such as http://example.com/Product1 of the subject in this triple. Also, statements in RDF can be modeled with nodes and arcs used in traditional graph representation. For example, this statement can be illustrated as a node for the subject, a node for the object, and an arc for the property, directed from the subject node to the object node.

4Processing RDF Query (from the Viewpoint of Graph Pattern Matching)Query Variable is denoted with a question mark (e.g., ?product)2. Example RDF Query:SELECT * WHERE{?product :name ?productName .

?product :price ?productPrice .

?product :date ?productDate .} Graph Pattern(Three) Triple Patterns1. Example RDF Dataset:Example Data: RDF graph on mobile devices Oval: Resources in the WebRectangle: Literals:Producer1:Product1iphone42011-10-14:name:design:date:Product2iphone52012-09-12:name:design:datewww.apple.com:homepage:price$499:Producer1:Product1iphone42011-10-14:name:design:date:Product2iphone52012-09-12:name:design:datewww.apple.com:homepage:price$499:Producer1:Product1iphone42011-10-14:name:design:date:Product2iphone52012-09-12:name:design:datewww.apple.com:homepage:price$499:Producer1:Product1iphone42011-10-14:name:design:date:Product2iphone52012-09-12:name:design:datewww.apple.com:homepage:price$499SELECT * WHERE{?product :name ?productName .

?product :price ?productPrice .

?product :date ?productDate .} SELECT * WHERE{?product :name ?productName

?product :price ?productPrice

?product :date ?productDate .} SELECT * WHERE{?product :name ?productName .

?product :price ?productPrice .

?product :date ?productDate .} A star pattern whose subject variable is ?productNow lets see how we can run a query against RDF dataset. So, processing RDF query is essentially a graph pattern matching process against graph data. This is the example of RDF Query Language, and we specifies a graph pattern to be matched inside of this WHERE clause, and this graph pattern consists of a set of triple patterns, and triple patterns are basically RDF triples that may contain query variables at the subject, property, or object position. For example, all the triple pattern in this query contains product variables denoted by question mark in the subject field. Processing this graph pattern means that we are going to find triples matching with these triple patterns from the dataset and the result of this query is a set of variable bindings that represents a matching subgraph in the queried RDF graph.

Let me show you how this example query can be processed from the viewpoint of graph pattern matching. This is the example dataset on mobile device products and we want to know the name of product with its price and produced data information from this RDF graph. Assume that we are processing the first triple pattern first - we are going to find the triples matching to the first triple pattern whose property is :name, and there are two triples satisfying this constraint. Now we are looking for the triples which have equal subject value with the previous matching result, and its property value is price This is because the first and second triple patterns have a common subject variable. Product. As you can see, there is no triple whose subject is :Product2 and property is price, so the previous matching result on Product2 will be discarded. However, there is a triple whose subject is Product1 and property is price, so the subgraph whose subject value is Product1 is a valid result. The third triple pattern will be processed in a similar way of the second one, so this red subgraph is the final matching result. Because this subgraph looks like a star, we call this graph pattern sharing a common subject value is a star pattern.

5Processing RDF Query(based on Relational Algebra)Implicit joins on ?productSubjectPropertyObject:Product1:price$499:Product1:nameiphone 4:Product1:date2011-10-14:Product2:nameiphone 5:Product2:date2012-09-122. Example RDF Query:4. (Intermediate) Result:1. Example RDF Dataset

First scan of relation R

Second scan of relation RThird scan of relation RSELECT * WHERE{?product :name ?productName .?product :price ?productPrice .?product :date ?productDate .} 3. Conceptual Execution Plan(Subject = Subject)(Subject = Subject)(:Product1, :name, iphone 4)(:Product2, :name, iphone 5)(:Product1, :name, iphone 4, :Product1, :price, $499)(:Product1, :name, iphone 4, :Product1, :price, $499, :Product1, :date, 2011-10-14)SubjectPropertyObject:Product1:price$499:Product1:nameiphone 4:Product1:date2011-10-14:Product2:nameiphone 5:Product2:date2012-09-12SubjectPropertyObject:Product1:price$499:Product1:nameiphone 4:Product1:date2011-10-14:Product2:nameiphone 5:Product2:date2012-09-12SubjectPropertyObject:Product1:price$499:Product1:nameiphone 4:Product1:date2011-10-14:Product2:nameiphone 5:Product2:date2012-09-12Relation RSELECT * WHERE{?product :name ?productName .?product :price ?productPrice .?product :date ?productDate .} SELECT * WHERE{?product :name ?productName .?product :price ?productPrice .?product :date ?productDate .} SELECT * WHERE{?product :name ?productName .?product :price ?productPrice .?product :date ?productDate .} (Subject = Subject)(Subject = Subject)(Subject = Subject)(Subject = Subject)(Subject = Subject)(Subject = Subject)Now lets see how this example query can be processed using relational algebra. We store RDF triples in a single relation whose schema is subject, property, and object, and this is the conceptual execution plan; First triple pattern is translated into selections of rows whose property is name using a full scan, and this is the intermediate result. Next, the second triple pattern is translated into selections operations of triples whose property is price; the selected tuples from those two selection operations should be joined based on a subject field because those triples share a common subject variable. and this is the intermediate result. Processing third triple patterns will be made similar with the second one, and this is the final result. In summary, processing RDF graph pattern consists of multiple whole scans of the dataset with expensive multiple self-join operations.

6DiskOverview of MapReduceDiskDiskHDFS= sort= mergeHDFSMapReduce (MR): Large-scale data processing systems running on a cluster of machines. [DEAN04] Encode tasks in terms of low level code as map/reduce functions, which are executed in parallel across the cluster.2. Reduce(k2, list (v2)) list(v3)1.Map(k1,v1)list(k2,v2)[NYKIEL10] MapReduce is a large-scale data processing systems running on a cluster of machines, and users just encode their tasks in terms of low level code as map/reduce functions, which are executed in parallel across the cluster. TheMapfunction is applied in parallel to every pair in the input dataset and this produces a list of pairs for each call of map instances, m1, m2, and m3; therefore, the cost of Map functio