Scan-Sharing for Optimizing RDF Graph Pattern Matching on MapReduce

  • Published on
    25-Feb-2016

  • View
    37

  • Download
    0

DESCRIPTION

Scan-Sharing for Optimizing RDF Graph Pattern Matching on MapReduce HyeongSik Kim, Padmashree Ravindra , Kemafor Anyanwu { hkim22, pravind2, kogan }@ ncsu.edu. COUL Semantic CO mp U ting research L ab. Outline. Background RDF Graph Pattern Matching - PowerPoint PPT Presentation

Transcript

Presentation Title Goes HereScan-Sharing for Optimizing RDF Graph Pattern Matching on MapReduceHyeongSik Kim, Padmashree Ravindra, Kemafor Anyanwu{hkim22, pravind2, kogan}@ncsu.eduCOUL Semantic COmpUting research Lab1OutlineBackgroundRDF Graph Pattern Matching Graph Pattern Matching on MapReduceQueries with Repeated Properties (QRP)Nested Triplegroup Algebra (NTGA)Challenges: Processing QRP with NTGAApproach: TripleGroup CloningWell-formed, Ambiguous, and Perfect TripleGroupsTripleGroup Cloning in TG_GroupFilterEvaluationRelated WorkLet me begin with the background section, and I will explain some background knowledge to understand my research topics such as the Semantic web and its data model RDF and MapReduce and so on.2The Growing Amount of RDF dataMay 2007 - # of datasets: 12Sep 2011 - # of datasets:295Growing #RDF triples: currently 31 billion The amount of RDF on the web is rapidly growing.Example: DBPedia (http://dbpedia.org) A dataset extracted from Wikipedia.Contains 1 billion RDF triples.Linked Data on the web: Now the issue is that the amount of RDF on the web is rapidly growing. For example, there is a dataset called dbpedia which consists of extracted information from Wikipedia and contains around 1 billion RDF triples inside. The amount of the triples published in the web was relatively small around 5 years ago, but now there are 31 billons triples released in the web. Therefore, Scalable RDFquery processing technique is required to process those RDF triples, 3RDF Data Model(Resource Description Framework)How is knowledge represented in the Semantic Web? e.g., Information on mobile device products. Resource Description Framework (RDF) is used.W3Cstandard data model for the Semantic webas Ex. product1 has a name called iphone4 as RDF.Represent information as a form of triple.A subject as product1 A property as name An object as iphone4 (:Product1, :name, :iphone4)Data model is a directed labeled graph.Node: subject, object Labeled edge: property:Producer1:Product1iphone4:name:design:Product2iphone5:name:designwww.apple.com:homepage:price$499:date:dateThe Semantic Web is an extension of the current web, which provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. The Semantic Web stack builds on the Resource Description Framework or RDF, which is a generic data modelfordatainterchange on the Web. RDF basically represents information in the web using triples of the form (subject, property, object). For example, assume that we want to represent the description on mobile device products such as "Product1 hasColor White" in RDF. One way to represent this is that we decompose this statement as a subject "the product1", a property "has the color", and an object "white". After then, each element in RDF triple is replaced with URIs as a unique IDs in the web, such as http://example.com/Product1 of the subject in this triple. Also, statements in RDF can be modeled with nodes and arcs used in traditional graph representation. For example, this statement can be illustrated as a node for the subject, a node for the object, and an arc for the property, directed from the subject node to the object node. 4Processing RDF Query (from the Viewpoint of Graph Pattern Matching)Query Variable is denoted with a question mark (e.g., ?product)2. Example RDF Query:SELECT * WHERE{?product :name ?productName .?product :price ?productPrice .?product :date ?productDate .} Graph Pattern(Three) Triple Patterns1. Example RDF Dataset:Example Data: RDF graph on mobile devices Oval: Resources in the WebRectangle: Literals:Producer1:Product1iphone42011-10-14:name:design:date:Product2iphone52012-09-12:name:design:datewww.apple.com:homepage:price$499:Producer1:Product1iphone42011-10-14:name:design:date:Product2iphone52012-09-12:name:design:datewww.apple.com:homepage:price$499:Producer1:Product1iphone42011-10-14:name:design:date:Product2iphone52012-09-12:name:design:datewww.apple.com:homepage:price$499:Producer1:Product1iphone42011-10-14:name:design:date:Product2iphone52012-09-12:name:design:datewww.apple.com:homepage:price$499SELECT * WHERE{?product :name ?productName .?product :price ?productPrice .?product :date ?productDate .} SELECT * WHERE{?product :name ?productName?product :price ?productPrice?product :date ?productDate .} SELECT * WHERE{?product :name ?productName .?product :price ?productPrice .?product :date ?productDate .} A star pattern whose subject variable is ?productNow lets see how we can run a query against RDF dataset. So, processing RDF query is essentially a graph pattern matching process against graph data. This is the example of RDF Query Language, and we specifies a graph pattern to be matched inside of this WHERE clause, and this graph pattern consists of a set of triple patterns, and triple patterns are basically RDF triples that may contain query variables at the subject, property, or object position. For example, all the triple pattern in this query contains product variables denoted by question mark in the subject field. Processing this graph pattern means that we are going to find triples matching with these triple patterns from the dataset and the result of this query is a set of variable bindings that represents a matching subgraph in the queried RDF graph. Let me show you how this example query can be processed from the viewpoint of graph pattern matching. This is the example dataset on mobile device products and we want to know the name of product with its price and produced data information from this RDF graph. Assume that we are processing the first triple pattern first - we are going to find the triples matching to the first triple pattern whose property is :name, and there are two triples satisfying this constraint. Now we are looking for the triples which have equal subject value with the previous matching result, and its property value is price This is because the first and second triple patterns have a common subject variable. Product. As you can see, there is no triple whose subject is :Product2 and property is price, so the previous matching result on Product2 will be discarded. However, there is a triple whose subject is Product1 and property is price, so the subgraph whose subject value is Product1 is a valid result. The third triple pattern will be processed in a similar way of the second one, so this red subgraph is the final matching result. Because this subgraph looks like a star, we call this graph pattern sharing a common subject value is a star pattern.5Processing RDF Query(based on Relational Algebra)Implicit joins on ?productSubjectPropertyObject:Product1:price$499:Product1:nameiphone 4:Product1:date2011-10-14:Product2:nameiphone 5:Product2:date2012-09-122. Example RDF Query:4. (Intermediate) Result:1. Example RDF DatasetFirst scan of relation RSecond scan of relation RThird scan of relation RSELECT * WHERE{?product :name ?productName .?product :price ?productPrice .?product :date ?productDate .} 3. Conceptual Execution Plan(Subject = Subject)(Subject = Subject)(:Product1, :name, iphone 4)(:Product2, :name, iphone 5)(:Product1, :name, iphone 4, :Product1, :price, $499)(:Product1, :name, iphone 4, :Product1, :price, $499, :Product1, :date, 2011-10-14)SubjectPropertyObject:Product1:price$499:Product1:nameiphone 4:Product1:date2011-10-14:Product2:nameiphone 5:Product2:date2012-09-12SubjectPropertyObject:Product1:price$499:Product1:nameiphone 4:Product1:date2011-10-14:Product2:nameiphone 5:Product2:date2012-09-12SubjectPropertyObject:Product1:price$499:Product1:nameiphone 4:Product1:date2011-10-14:Product2:nameiphone 5:Product2:date2012-09-12Relation RSELECT * WHERE{?product :name ?productName .?product :price ?productPrice .?product :date ?productDate .} SELECT * WHERE{?product :name ?productName .?product :price ?productPrice .?product :date ?productDate .} SELECT * WHERE{?product :name ?productName .?product :price ?productPrice .?product :date ?productDate .} (Subject = Subject)(Subject = Subject)(Subject = Subject)(Subject = Subject)(Subject = Subject)(Subject = Subject)Now lets see how this example query can be processed using relational algebra. We store RDF triples in a single relation whose schema is subject, property, and object, and this is the conceptual execution plan; First triple pattern is translated into selections of rows whose property is name using a full scan, and this is the intermediate result. Next, the second triple pattern is translated into selections operations of triples whose property is price; the selected tuples from those two selection operations should be joined based on a subject field because those triples share a common subject variable. and this is the intermediate result. Processing third triple patterns will be made similar with the second one, and this is the final result. In summary, processing RDF graph pattern consists of multiple whole scans of the dataset with expensive multiple self-join operations.6DiskOverview of MapReduceDiskDiskHDFS= sort= mergeHDFSMapReduce (MR): Large-scale data processing systems running on a cluster of machines. [DEAN04] Encode tasks in terms of low level code as map/reduce functions, which are executed in parallel across the cluster.2. Reduce(k2, list (v2)) list(v3)1.Map(k1,v1)list(k2,v2)[NYKIEL10] MapReduce is a large-scale data processing systems running on a cluster of machines, and users just encode their tasks in terms of low level code as map/reduce functions, which are executed in parallel across the cluster. TheMapfunction is applied in parallel to every pair in the input dataset and this produces a list of pairs for each call of map instances, m1, m2, and m3; therefore, the cost of Map function mainly consists of three parts: first one is to read the input data from a distributed file system called HDFS and the second one is executing map function itself, and third one is to sort and write the intermediate data into its local disks. After then, the MapReduce framework collects all the pairs with the same key from all lists and groups them together; TheReducefunction is then applied in parallel to each group; and they will produce a collection of values; therefore, the cost of reduce functions mainly consists of three parts as well, first, reading intermediate date stored in each map node, and they are transferred into the reduce nodes, sorted, and merged as a input of reduce function, the second cost is executing reduce functions, the final cost consists of sending those results into HDFS. As you can see, executing a single MapReduce job requires expensive disk and network I/O operations.Those B1,1, and B2,1 and B3,1 have the same key, for example 1, therefore, they are grouped and tranfered into the reducer1.7(k2, ((L: k2, v4), (R: k2, v1))(k2, v4, k2, v1)HDFSJoin Processing on MapReduceResult:Reduce: Separate and buer the input records into two sets according to the table tag (L or R)Perform a cross-product(k1, v5)(k2, v4)L(k2, v1)(k3, v6)R(k1, (L: k1, v5))(k2, (R: k2, v1))(k2, (L: k2, v4))(k3, (R: k3, v6))(k1,v5)(k2,v1)(k2,v4)(k3,v6)(k1, ((L: k1, v5))(k3, ((R: k3, v6))(k2, v4, k2, v1)Map:Extract the join column Add a tag of either L or RAnnotate tuples with join key[BLANAS10] Example: Equi-join operation with the first column of relation L and RThis slides elaborate the equi-join operation on the MapReduce framework; usually the single join operation requires a single MapReduce job; now assume that left and right relations are stored in HDFS as files, then map function extract the join column, and add a tag for each tuple whether this tuple comes from the left or right relation, annotate it with join key; after then, those tuples are pulled by reducers based on its join key; and then reduce function separate and buffer the input records into two sets according to the tag in a tuple; and then perform the cross-product; In summary, executing a single join usually requires a single MR job, which is expensive.8Processing Multi-Join Query on MapReduce3. Corresponding Logical Plan based on VP1. (Extended) Example Query SELECT * WHERE{?product :name ?productName .?product :price ?productPrice .?producer :design ?product .?producer :type ?ProducerType .} (subject = subject)(subject = object)(subject = subject)namepricedesigntypetemp1(subject = subject)(subject = object)4. MapReduce Plansoutput(subject = subject)namepricetemp1designtemp2type[ABADI07] temp2[ABADI07] Now lets think about the case processing RDF query on MapReduce framework; instead of scanning the whole data file for each triple pattern, as a kind of optimization, RDF data is vertically partitioned based on properties in advance; for example, assume that a properties in a dataset are only four kinds, name, price, design, and type; then we can select triples based on its properties such as selecting triples whose property is name and store the result relation as a name in advance. Next time, we do not need to scan the whole dataset for each triple pattern, but we just need to select that property relation. And this is the logical plan for this example query, and for the first and second triple pattern, :name and :price relations are scanned and joined based on their subject field. Other triple patterns will be processed in a similar way of this. One thing we should notice is that this logical plan is translated or split into multiple sub MapReduce plan; for example, the above logical plan can be translated into three MR jobs- because usually join operation in MapReduce requires a single MR job and the logical plan contains three joins; therefore three MR jobs are generated in this case, and executing each of them requires expensive cost.9Query Optimization on MapReduceHeuristic to group operations -> fewer MR jobs in a workflow.Group multiple join operations on the same key in same MR cycle. (Pig)Finding optimal grouping is NP-hard; more advanced techniques use greedy approach that groups non-conflicting joins as much as possible.1. (Extended) Example Query SELECT * WHERE{?product :name ?productName .?product :price ?productPrice .?product :date ?productDate .?producer :design ?product .?producer :type ?ProducerType .} temp1(subject = subject)(subject = subject)output(subject = object)namepricedesigntypetemp1temp2temp2date[HUSAIN11] (subject = subject)(subject = object)(subject = subject)namepricedatetypedesign(subject = subject)2. Corresponding Logical Plan based on VPTherefore, query optimization in MapReduce focuses on minimizing the length of MR workflow, and try to put multiple operations together into the same job as much as possible. Some heuristics are commonly used in MapReduce-based processing system, such as Apache Pig. Pig accepts user queries written in its own language called Pig Latin, and compile them, and try to group multiple joins on the same key in the same cycle. This is possible because the map function in join operations just tags the tuples based on the dataset they originate from, and all tuples having the same value for the join key, will be received by the same Reducer and only one reduce function is called for one key value. Inside reducer, once the tuples for a particular key are divided as per their parent datasets, it is just a cartesian product of these tuples. Leveraging this fact, processing RDF query containing multiple triple patterns sharing a common subject variable can be done in only a single MR cycle regardless of the number of triple patterns.HodoopRDF, which is another MapReduce-based RDF processing system extend this idea; using greedy approach, they are trying to groups non-conflicting joins as much as possible. Confliction means that a single map operation cannot produce a tuple tagged with multiple attributes. Assume that we are trying to group join operations based on the first field and second field of this key pair. This results in confliction because we cannot annotate tuple based on its first and second field at the same time in the map phase. Using their approach, this is the first MR jobs grouping two independent joins for each star pattern in a single MR job. There is no confliction in this case because these two join operations will be made based on subject field. Then, second job joins or connects two stars from two star patterns. So far, we discussed overall optimization techniques in MapReduce. // the first and second star pattern requires join operations on the subject column.This tuple (k1, v5) should be tagged with its subject value or object value, not both. Map functions assume that there is only a single key value for each tuple; because based on this key, MapReduce system decides where this tuple should go for a specific reducer.10TS: TableScan (Load) operatorQueries with Repeated PropertiesIssue: name, type, date are scanned repeatedly across MR jobs J2, J3Possible Optimization Considerations:Minimize Scan overhead using indexes.MapReduce does not support any indexes by default.Buffer such relations across multiple joins (memory intensive)Another approach : Algebraic OptimizationRewrite queries to equivalent queries but less expensive ones.Example QuerySELECT * WHERE{?product :name ?prodName .?product :type ?prodType.?product :date ?prodDate . ?product :price ?prodPrice .?producer :design ?product .?producer :name ?prcName .?producer :type ?prcType .?producer :date ?prcDate .} J1SPLITpricenametypedatedesignHDFSHDFSJ2HDFSTS(name, type, JOIN(price, name, )JOIN(name, type, )J4JOINTS(R)J3TS(design)TS(type)TS(name)TS(date)TS(price)TS(type)TS(name)TS(date)TS(price, name, Query: We want to see the list of the products with detail information and its producer information as well (e.g., the company name, the type of company, and its foundation date)Now lets see how the graph pattern can be processed if it contains repeated properties, which is the query pattern we try to optimize in this presentation. - So repeated properties means that we have duplicate properties across the sub star patterns; for example, we can see the properties :name, :type, :date in the first star pattern and the same set of properties are in the second star pattern, and this kind of query is pretty common in RDF query.Assume that we process this query using the operators from Pig based vertical partitioning approaches. Then, one MapReduce cycle is used to split the graph data based on properties, and store them in HDFS. Next, two MapReduce jobs are used to build each sub stars; for example, second jobs reads four property relations and join them based on the common subject, and produce sub star matching to the first star pattern. Similarly, the third job reads four properties from the HDFS, and build a sub star matching to the second star pattern. As you can see, those repeated properties are read again for the third job, which requires additional communication cost and disk I/O. finally, the fourth job join those two stars and produce the result sets. To relieve this redundant disk and network I/O for reading repeated properties, we might be able to leverage indexes if they are available. However, MapReduce system initially does not provide any built-in indexes. Or we could buffer the repeated property relations if they can be processed within a cycle, however, this is memory intensive and its possible that the size of such relations are bigger than the amount of memory of each node. Alternative approach to relieve this issue is that we could apply algebraic optimization, which rewrite queries to equivalent ones, but less expensive ones. We optimize this kinds of queries based on our previous work, Nested TripleGroup Algebra.11General Intuition in NTGANested TripleGroup Algebra (NTGA) : Re-interpret multiple star-joins as a grouping operationleads to groups of Triples (TripleGroups) instead of n-tuples [RAVINDRA11] Example Query SELECT * WHERE { 1: ?x :p1 ?o1 . 2: ?x :p2 ?o2 . 3: ?y :p3 ?o2 .4: ?y :p4 ?o3 . }SubjectPropertyObject:s1:p1:o1:s1:p2:o2:s2:p3:o3:s2:p4:o42. Input Triplestg1 =tg2 =(:s1, :p1, :o1) (:s1, :p2, :o2)(:s2, :p3, :o3) (:s2, :p4, :o4)VP: 1MR job for each star pattern 2MR jobs!each MR job for star pattern whose subject variable ?x, ?yNTGA: 1MR job for all star patterns!t1 =(:s1, :p1, :o1, :s2, p2, o2) t2 =(:s2, :p3, :o3, :s3, p4, o4) different structure BUT content equivalent p1 (subject=subject) p2p3 (subject=subject) p4SubjectPropertyObject:s1:p1:o1:s1:p2:o2:s2:p3:o3:s2:p4:o4SubjectPropertyObject:s1:p1:o1:s1:p2:o2:s2:p3:o3:s2:p4:o4Our previous work, Nested TripleGroup Algebra or NTGA re-interprets multiple star-joins as a grouping operation; For example, we have a query containing two sub star patterns. Assuming vertical partitioning approach is used, these two triples can be considered as a matching result of the first star pattern, and this is the result. Similarly, these tuples are matched with the second star pattern, so this is the result. And our intuition is that grouping these triples based on its subject actually produces equivalent result set of two join operations, like this, and its data model is different, but the contents in both data model group of triples or this n-tuple actually contain equivalent triples inside. Our approach always uses a single MR job for star-joins and we have a number of our own operators because the data model is different.12J1TG_GroupFilter:name:type:date:price:design:name:type:dateTG_GroupByHDFSHDFSTS: TableScan (Load) operatorProcessing RDF Query with NTGANTGA:VP:J2TG_JOINTG_FlattenTG_UnnestTS(R)TS(Rpltd)TS(Rltds)4 MR jobs (4 HDFS reads)2 MR jobs (2 HDFS reads)Example QuerySELECT * WHERE{?product :name ?prodName .?product :type ?prodType.?product :date ?prodDate . ?product :price ?prodPrice .?producer :design ?product .?producer :name ?prcName .?producer :type ?prcType .?producer :date ?prcDate .} J1SPLITpricenametypedatedesignHDFSHDFSJ2HDFSTS(name, type, JOIN(price, name, )JOIN(name, type, )J4JOINTS(R)J3TS(design)TS(type)TS(name)TS(date)TS(price)TS(type)TS(name)TS(date)TS(price, name, Now lets see the NTGA-based plan - NTGA just uses two MapReduce jobs for the equivalent result set; first MR job is used to produce TripleGroup from group-by subject operation, and filter out the triplegroups violating structural constraints, and finally produce two kinds of TripleGroups, and the second job is used to join between those star patterns, and unnest and flatten them, and then the result sets is equal to the one from VP approach. As you can see, NTGA can process using only 2 MR jobs while VP should use 4 MR jobs. * data providers should use such terms from widely deployed vocabularies to represent data wherever possible; for example, dc:date is widely used to describe some date-related information, or rdfs:name is commonly used.13A "Key" NTGA Operator: TG_GroupFilter. Retain only TripleGroups that satisfy the required query sub structureCheck exact match between a set of property in star patterns and a TripleGroupExample Query: SELECT * WHERE { 1: ?x :p1 :o1 . 2: ?x :p2 ?y . 3: ?y :p3 :o2 .4: ?y :p4 :o3 . }tg1 =tg2 ={(:s1, :p1, :o1) (:s1, :p2, :o2)Input TripleGroups:(:s2, :p2, :o2) (:s2, :p3, :o3)},No Matches.Therefore, tg2 filtered out.Correct match.Therefore, tg1 passes.NTGA provides a number of operators, and this TG_GroupFilter is the one of them, which enforces the structural constraints specified in a star pattern subquery. Because grouping operation itself literally just groups triples based on its subject, and does not enforce any constraints or any conditions how triples should be grouped for matching with sub star patterns; therefore, this operation provides such functions. This operator accepts TripleGroups generated from group-by operation, and check exact match between the set of properties in star pattern and its triplegroup. For example, TripleGroup tg1s properties are p1 and p2, and it is matched with the first star pattern in a query, therefore, it is passed. However, tg2 is filtered out because there is no matching sub star patterns in a query.14OutlineBackgroundRDF Graph Pattern Matching Graph Pattern Matching on MapReduceQueries with Repeated Properties (QRP)Nested Triplegroup Algebra (NTGA)Challenges: Processing QRP with NTGAApproach: TripleGroup CloningWell-formed, Ambiguous, and Perfect TripleGroupsTripleGroup Cloning in TG_GroupFilterEvaluationRelated WorkSo Far, I explained most of the background knowledge for RDF graph pattern matching and how it can be done on MapReduce, and provide brief descriptions on our previous work, NTGA. Now let me explain how we can process queries with repeated properties on NTGA. 15SELECT * WHERE{?product :name ?prodname .?product :type ?prodType.?product :date ?prodDate . ?product :price ?prodPrice .?producer :design ?product .?producer :name ?prcName .?producer :type ?prcType .?producer :date ?prcDate .} TG_GroupFilter Semantics and Repeated Properties.s1 :type o1s1 :name o2s1 :date o3s1 :price o4s1 :design o51. Given triple pattern2. A triplegroup from TG_GroupBytg0 =Stp2Stp1Assumes 1-1 correspondence between TripleGroups and star subpatterns.But with repeated properties there can be ambiguities (Partial Match with stp1 and stp2)??So the NTGA-based plan produces efficient execution plans, but there is an issue on the semantic of TG_GroupFilter; until now, we assumed that there is a one-to-one mapping between generated TripleGroups and sub star patterns. However, this assumptions fails if the query pattern contains repeated properties; for example, this is the graph pattern used in the previous slide, and assume that we have a triplegroup containing properties in both star patterns; in this case, it becomes ambiguous whether this triplegroup should be matched with the first sub star pattern or the second sub star pattern because it is partially matched to both patterns; therefore, we need to relaxes the assumption that each triplegroup is for each star pattern, and need to extend the part for managing and identify triplegroups.16Overview of the SolutionIssue: Mappings between TripleGroups and star patterns become ambiguous if repeated properties exist across multiple star patterns. Goal: Produce TripleGroups that can be a exact match with a star pattern in a query.Solution: Classify the filtering processing into two steps.Remove out incomplete TripleGroups that do not match with any star patterns (or eliminate Non-well-formed TripleGroups)Solve the ambiguity of remaining TripleGroups that may match with multiple star patterns (Ambiguous TripleGroup) and generate TripleGroups that can be an exact match with a star pattern (Perfect TripleGroup)Well-formed TripleGroup1. Example Querystp1stp2s1 :name :o1s1 :date :o2s1 :price :o3tg1=s1 :name :o1s1 :date :o2s1 :price :o3s1 :design :o4tg2=s1 :name :o4s1 :design :o3 tg3=Well-formed TripleGroup: a TripleGroup consisting of triples which contains all the properties of some star subpattern. 2. TripleGroups generated from TG_GroupBySELECT * WHERE{?product :name ?prodname .?product :date ?prodDate . ?product :price ?prodPrice .?producer :design ?product .?producer :name ?prcname .?producer :date ?prcdate .} As a foundations for the approach, we first categorize TripleGroups into three types. First, we can exclude TripleGroups which is not likely to match each sub star patterns. We can keep TripleGroups which contain all the properties in some substar patterns which is called as Well-formed TripleGroup. For example, tg1 is well-formed because it , and tg2 is also well-formed because it contains all the properties for both star pattern stp1 and stp2, so this can be considered as a candidate TripleGroup for answers. Third one is not well-formed because it does not contain all the required properties for star pattern1 or 2. So we considers that only wellf-formed TripleGroup is a candidate or has a potential to be an answer for each star patterns.18Ambiguous&Perfect TripleGroupAmbiguous TripleGroup : a well-formed TripleGroup that can be matched with multiple star subpatterns in a query, e.g. tg2 Perfect TripleGroup : a well-formed TripleGroup which is an exact match for a single star pattern.* (valid intermediate answers)1. Example Querystp1stp2s1 :name :o1s1 :date :o2s1 :price :o3tg1=s1 :name :o1s1 :date :o2s1 :price :o3s1 :design :o4tg2=2. TripleGroups generated from TG_GroupBySELECT * WHERE{?product :name ?prodname .?product :date ?prodDate . ?product :price ?prodPrice .?producer :design ?product .?producer :name ?prcname .?producer :date ?prcdate .} * a single star pattern class And among these well-formed triplegroup, we call the triplegroups that it could match into multiple star patterns as an ambiguous one; for example, tg2 is the ambiguous one because it could be related with the first, second, or third sub star patterns. The third type of the triplegroup is the perfect one, which is the unambiguous one matching exactly with a star pattern, and this is the valid intermediate answers for the query.19Dealing with Ambiguous TripleGroupss1 :name :o1s1 :date :o2s1 :price :o3tg1=s1 :design :o4s1 :name :o1s1 :date :o2tg2=s1 :sell ??s1 :name :o1 tg3=Ambiguous TripleGroupClone(:name, :date, :price)Clone(:sell,:name)Clone (:design, :name, :date)stp1stp2SELECT * WHERE{?product :name ?prodname .?product :date ?prodDate . ?product :price ?prodPrice .?producer :design ?product .?producer :name ?prcname .?producer :date ?prcdate .?seller :sell ?product?seller :name ?selName} s1 :name :o1s1 :date :o2s1 :price :o3s1 :design :o4stp3Perfect TripleGrouptg0=So our approach to solve this issue is that we clone perfect TripleGroups from each ambiguous TripleGroup. For example, this ambiguous TripleGroup can be cloned into two perfect TripleGroups because it contains all the properties for both star patterns. We select triples inside TripleGroup and generate TripleGroup matching for each star pattern. For example, we select triples whose property is name date and price and generate a perfect TripleGroup for star pattern1. However, we cannot clone the perfect TripleGroup for stp3 because the triplegroup tg0 does not have triple whose property is :sell inside.Also, this clone operation results in sharing scan effect because we only spend the cost generating this single ambiguous TripleGroup; for example, to produce equivalent result sets, VP approach has to read all these triples in perferct TripleGroup separately from HDFS. Reading this six triples whose property is name, date, price, and design, name, date. separately. 20Generated MR PlanJ1: Mapm:TG_GroupByr:TG_GroupByr:TG_GroupFilter*(Revised)m:TG_JOIN(?o1 = ?o1)r:TG_JOINJ2: MapJ1: ReduceJ2: ReduceJ1J2 NTGA-based MapReduce Plan m:op : Map-side Operatorr:op :Reduce-side OperatorExample QueryClone in TG_GroupFilter{}{},(clone)()SELECT * WHERE{?product :name ?prodname .?product :date ?prodDate . ?product :price ?prodPrice .?producer :design ?product .?producer :name ?prcname .?producer :date ?prcdate .?seller :sell ?product?seller :name ?selName} s1 :name :o1s1 :date :o2s1 :price :o3s1 :design :o4tg0=s1 :name :o1s1 :date :o2s1 :price :o3tg1=s1 :design :o4s1 :name :o1s1 :date :o2tg2=So this slides show NTGA-based MapReduce plan containing revised TG_GroupFilter. Before filtering out TripleGroups based on exact match, we first solve the ambiguity by cloning operation; for example, tg0 is cloned into two perfect TripleGroups tg1, and tg2. and the join operation will be made and produce the correct results. 21Losslessness of Revised TG_Groupfilter.SubjectPropertyObject:s1:price:o1:s1:name:o2:s1:date:o3:s1:design:o4s1 :name :o1s1 :date :o2s1 :price :o3s1 :design :o4tg0=,s1 :name :o1s1 :date :o2s1 :price :o3tg1=s1 :design :o4s1 :name :o1s1 :date :o2tg2=t1 = (:s1, :name, :o1, :s1, :date, :o2, :s1, :price, :o3)1) name (subject=subject) date (subject=subject) pricet2 = (:s1, :design, :o4, :s1, :name, :o1, :s1, :price, :o3)1. Relational Algebra (VP)2. NTGAExample Dataset(clone)Filter out non-well-formed TripleGroup.Incomplete TripleGroup that does not contain all the properties for any star patterns clearly does not match any star patterns in a query.Generate multiple Perfect TripleGroups from an ambiguous TripleGroups.2) design (subject=subject) name (subject=subject) dateGIve brief description provides lossless of cloning: Why they are equivalent: we don't generate superflous or loss triplegroup from the proof. 22OutlineBackgroundRDF Graph Pattern Matching Graph Pattern Matching on MapReduceQueries with Repeated Properties (QRP)Nested Triplegroup Algebra (NTGA)Challenges: Processing QRP with NTGAApproach: TripleGroup CloningWell-formed, Ambiguous, and Perfect TripleGroupsTripleGroup Cloning in TG_GroupFilterEvaluationRelated Work23Setup and TestBed Setup: Implement VP and NTGA on top of Apache Pig.10-node Hadoop clusters on NCSUs VCL*.Three approaches were considered :1-join-per-cycle (SHARD)1-star-join-per-cycle (Pig-Def or VP)all-star-joins-1-cycle (NTGA) Evaluation of the redundant scans during star-join computations.Task 1a varying the ratio of repeated properties to fixed ones. Task 1b varying the selectivity of repeated properties.Task 2 scaling up sub patterns with repeated properties.Task 3 scalability test with varying data size *https://vcl.ncsu.edu [ROHLOFF10] 24DatasetDataset: Synthetic benchmark dataset generated using BSBM*From 22GB (250k Products, BSBM-250k ~86M triples) Up to 87GB (1M Products, BSBM-1000k ~350M triples)7 repeated properties:- Across all classes e.g. type, publisher- Only for a smaller subset of classes, e.g. nameThe size and selectivity ** of BSBM-250k : :publisher - 1.7GB, 0.091:type - 1.8GB, 0.105:name - 49MB, 0.003:date - 1.4GB, 0.09125Task 1a: Varying the Ratio of Repeated Properties to Fixed ones. Test Queries (dq0 to dq4)Two star patterns with fixed subset of unique properties + varying #repeated properties in the second star pattern (from 0 to 4). Overall #triple patterns increase from 8 to 12:publisher:name:type:datedq0: 2 star pattern, 0 repeated properties.dq4: 2 star patterns, 4 repeated properties.(:type, :publisher, :name, :date)Black edge: arbitrary unique propertyRed edge: repeated property:publisher:name:type:date:publisher:name:type:datedq1: 1 repeated props.dq2: 2 repeated props.dq3: 3 repeated props.26Task 1a: Varying the Ratio of Repeated Properties to Fixed ones. Pig-Def (4 MR cycles), NTGA(2 cycles), SHARD (13 cycles)1-star-join-per-cycle (Pig-Def)1-join-per-cycle (SHARD)all-star-joins-1-cycle (NTGA)With increasing #repeated properties, 1. NTGA : Constant HDFS reads and execution time : Less HDFS writes due to the fewer number of required MR jobs.2. SHARD #the scans of the whole relations are increased.3. Pig-Def or VP : #the scans of the property relations are increased.From dq0 to dq4, NTGA shows that the execution time and the amount of HDFS read is almost constant even though we increase the number of repeated properties from 0 to 4. The amount of HDFS write is not much relevant with the scan-sharing effect, but NTGA shows less amount of HDFS write because the number of required MR jobs in NTGA is smaller than other approaches. For example, the following timeline shows that NTGA requires only two MR jobs while Pig-dev uses 4 MR jobs and SHARD uses 13 MR jobs, and NTGA provide fastest execution time among three approaches.27Task 1b: Varying the Size of Repeated PropsTest Queries rq1 and rq2Identical queries with two star subpatternsbut contain a different repeated property.- rq1 : :publisher - 1.7GB, 9.1%- rq2 : :name - 49MB, 0.3%NTGA has around 42% performance gain over Pig-Def for rq2 and increases to around 48% gain for rq1. With rq2, Pig-Def always uses additional 70 seconds than rq1.:publisher:publisher:name:namerq1: two star pattern with repeated property :publisherrq2: two star pattern with repeated property :nameTask four is a very simple task that checks whether selectivity of repeated properties in a dataset affects the performance variations. To test this, the BSBM data generator should provide the feature that controls the number of triples whose properties are repeated across patterns; however, BSBM does not, and other RDF data generator does not provide such features. Therefore, as a workaround, we design two identical queries containing a single repeated properties across stars, and we change those repeated properties; for example, rq1 contains property publisher while rq2 contains name property. Triples whole property is publisher takes around 9.1% of dataset while triples whose property is name only takes around 1%. We do not have much data on this task, but we confirmed that the amount of performance gains in NTGA compared with Pig are increased when we increase the amount of such repeated properties in a dataset.28Task 2: Scaling up Sub patterns with Repeated PropertiesFour queries (mq1 ~ mq4) Two repeated properties occur in each of the star subpatterns, Vary number of star patterns (1 to 4). The total number of repeated properties are increased across a graph pattern query: from 2 (in mq1) to 8 (in mq4) :publisher:typemq1: a single star patternmq2: two star patterns:publisher:type:publisher:type:publisher:type:publisher:type:publisher:typemq3: threestar patternsThe second task is designed to see whether NTGA could get advantages if we increase the number of star patterns containing repeated properties. We varied the number star patterns from 1 to 4, each star pattern consists of four triple patterns, and two of them are repeated across star patterns. In other words, we want to test performance variations if we indirectly increase the number of such repeated properties by increasing the number of star patterns.29mq1 mq4: #star patterns #repeated properties across star patterns (from 2 to 8), #the amount of scan-sharing across star patterns (from around 40G to 120G)Execution Time is increased due to join operations for connecting sub stars. Task 2: Scaling up Sub patterns with Repeated Properties1-star-join-per-cycle (Pig-Def)1-join-per-cycle (SHARD)all-star-joins-1-cycle (NTGA)40G 80G 120GThe amount of HDFS read shows no significant changes due to scan-sharing effect in repeated properties across star patterns. Execution Time is increased due to multiple join operations for connecting sub stars; for example, three additional MR jobs is required for mq4 compared with mq1 to connect four stars.30Task 3: Varying Size of GraphsIncreases #RDF triples for query dq4 used in Task1. From BSBM-250k (22GB) to BSBM-1000k (86GB)NTGA approach scales well.Performance gain is observed from 52% to 58%The size of relations containing repeated properties are not increased linearly when increasing the size of dataRelated WorkRDF Data Processing on MapReduce:SHARD[Rohloff10] : The clause-iteration algorithm (n +1 jobs to process n triple patterns)HadoopDB[Huang11] : A hybrid architecture of database (RDF-3x) and Hadoop with a graph partitioning scheme.HadoopRDF[Husain10] : A customized storage format and plan generation based on a heuristic greedy approach.Work Sharing on MapReduce:MRShare [NYKIEL10]: Inter-query sharing scheme customized into the MapReduce framework.NOVA [Olston11]: Share the initial load operation if multiple copies of workflow use the identical input.CoScan[Wang11]: Minimize redundant data loading by merging multiple Pig scripts.HadoopDB: it is comprised of Postgres on each node (database layer), Hadoop is used as a communication layer that coordinates the multiple nodes each running Postgres, andHiveas the translation layer. Later, postgres is replaced with RDF-3x, which is a single-node-based RDF processing framework to process RDF queries with a graph partitioning scheme.MRShare: transforms a batch of queries into a new batch that will be executed more eciently, by merging jobs into groups and evaluating each group as a single queryNova: a worklow manager that supports continuous, large-scale data processing on top of Pig/HadoopCoScan: a scheduling framework eliminating redundant processing in workflows that scan large batches of data. CoScan merges Pig programs from multiple users at runtime to reduce I/O contention while meeting a soft deadline requirements in scheduling.32Relevant PublicationsKim, H., Ravindra, P., Anyanwu, K.: Scan-Sharing for Optimizing RDF Graph Pattern Matching on MapReduce, In: Proc. CLOUD (2012)Anyanwu, K., Kim, H., Ravindra, P., : Algebraic Optimization for Processing Graph Pattern Queries in the Cloud, IEEE Internet Computing (2012)Kim, H., Ravindra, P., Anyanwu, K.: From SPARQL to MapReduce: The Journey using a Nested TripleGroup Algebra. In: Proc. International Conference on Very Large Data Bases (2011) (Demonstration).Ravindra, P., Kim, H., Anyanwu, K.: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Platforms, In: Proc. Extended Semantic Web Conference (2011)References[DEAN08] Dean, J., Ghemawat, S.: Mapreduce: simplied data processing on large clusters. Commun. ACM 51 (2008) 107113[OLSTON08] Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proc. International Conference on Management of data. (2008)[HUSAIN11] M. F. Husain, J. McGlothlin et al., Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing, TKDE, vol. 23, pp. 13121327, 2011.[HUANG11] J. Huang, D. J. Abadi et al., Scalable SPARQL Querying of Large RDF Graphs, Proc. VLDB, vol. 4, no. 11, 2011.[NYKIEL10] T. Nykiel, M. Potamias et al., MRShare: Sharing across Multiple Queries in MapReduce, Proc. VLDB, vol. 3, pp.494505, 2010.[OLSTON11] C. Olston, G. Chiou et al., Nova: Continuous Pig/Hadoop Workflows, in Proc. SIGMOD, 2011, pp. 10811090.[WANG11] X. Wang, C. Olston et al., CoScan: Cooperative Scan Sharing in the Cloud, in Proc. SOCC, 2011, pp. 11:111:12.[RAVINDRA11] P. Ravindra, H. Kim et al., An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce, in Proc. ESWC, 2011, vol. 6644, pp. 4661.[ABADI07] D. J. Abadi, A. Marcus et al., Scalable Semantic Web data Management using Vertical Partitioning, in Proc. VLDB,2007.[ROHLOFF10] K. Rohloff and R. E. Schantz, High-performance, Massively Scalable Distributed Systems using the MapReduce Software Framework: the SHARD Triple-store, in PSI EtA, 2010, pp. 4:14:5.[NEUMANN10] T. Neumann and G. Weikum, The RDF-3X engine for scalable management of RDF data, The VLDB Journal, vol. 19, pp. 91113, 2010.[WEISS08] C. Weiss, P. Karras, and A. Bernstein.Hexastore: Sextuple Indexing for Semantic Web Data Management, Proc. VLDB, vol. 1, no. 1, 2008.[HERODOTOU11] H. Herodotou and S. Babu. Proling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. Proc. VLDB, vol. 4, 2011[BLANAS1010] S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A Comparison of Join Algorithms for Log Processing in MapReduce. Proc. SIGMOD, 2010.Thank You!Oval: resources i.e. URIsRectangle: LiteralsRDF Data Model(Resource Description Framework)1. Statements (triples)2. Graph Representation:Producer1:ProducerApple1976-04-01apple.com:Product1:whiteiphone42011-10-14:name:name:publisher:date:color:type:date:homepageSubjectPropertyObject:Product1:nameiphone4:Product1:color:white:Product1:date2011-10-14:Product1:publisher:Producer1:Producer1:nameApple:Producer1:type:Producer:Producer1:date1976-04-01:Producer1:homepageapple.comStar subgraphs - set of edges with same subject e.g. :Product1 and :Producer1, RDF dataset consist of multiple triples, which can be represented as a nameed, directed graph. In this graph representation, there are two Star subgraphs consisting of set of edges with same subject node, Product1 and Producer1 in this case, and they are connected with Producer1 node. Ovals in this graph means the URI like :Producer and the rectangle means Literals like apple, which is a String or other common data type.36Relationship between TripleGroups and n-tuples different structure BUT content equivalent (:Product1, :type, :Product, :Product1, :date, 1976-04-01, :Product1, :name, iphone 4)t1t2t32. n-tuple in VP (SPLIT and JOIN)(:Product1, :type, :Product)(:Product1, :date, 1976-04-01), (:Product1, :name, iphone 4) tg1 = 1.TripleGroup in NTGA (TG_GroupBy and TG_GroupFilter)TripleGroups are not structurally equivalent to n-tuples but are content equivalent. So, the TripleGroups generated from grouping operation shows different structures with the n-tuples generated by vertical partitioning approach; but what this figure shows is that their contents are essentially equavalent.37#NTGA OperatorsResult1TG_Flatten(tg1)(:Prdct1, :name, iphone4, :Prdct1, :publisher, :Prdcr1, :Prdct1, :price, 100) 2TG_Join(?o :publisher ?v: TG{:name, :publisher, :price} ?v :type ?t : TG{:type, :date, :hpage} )ntg = {(:Prdct1, :name, iphone4), (:Prdct1, :publisher, (:Prdcr1, :type, :Prdcr), (:Prdcr1, :date, 1976-04-01), (:Prdcr1, :hpage, apple.com) (:Prdct1, :price, 100) }3TG_Unnest(ntg)(:Prdct1, :name, iphone4), (:Prdct1, :publisher, :Prdcr1),(:Prdcr1, :type, :Prdcr),(:Prdcr1, :date, 1976-04-01), (:Prdcr1, :hpage, apple.com) (:Prdct1, :price, 100) }NTGA Quick ReferenceConsider, a set of Triplegroups TG = {tg1 , tg2 } such that(:Prdct1, :name, iphone4), (:Prdct1, :publisher, :prdcr1), (:Prdct1, :price, 100) (:Prdcr1, :type, :Prdcr),(:Prdcr1, :date, 1976-04-01), (:Prdcr1, :hpage, apple.com) tg1 = tg2 = #NTGA OperatorsResult1TG_Flatten(tg1)(:Prdct1, :name, iphone4, :Prdct1, :publisher, :Prdcr1, :Prdct1, :price, 100) 2TG_Join(?o :publisher ?v: TG{:name, :publisher, :price} ?v :type ?t : TG{:type, :date, :hpage} )ntg = {(:Prdct1, :name, iphone4), (:Prdct1, :publisher, (:Prdcr1, :type, :Prdcr), (:Prdcr1, :date, 1976-04-01), (:Prdcr1, :hpage, apple.com) (:Prdct1, :price, 100) }#NTGA OperatorsResult1TG_Flatten(tg1)(:Prdct1, :name, iphone4, :Prdct1, :publisher, :Prdcr1, :Prdct1, :price, 100) So this is the quick reference tables showing the rest of NTGA operators, For example; TG_Flatten of tg1 flattens TripleGroup tg1 to corresponding n-tuples like this, and TG_Join is the operator used for the join operation between two TripleGroups; for example, tg1 and tg2 are joined based on the common value :prdcr1, and the results are nested TripleGroup like this, and TG_Unnest is the operation unnesting this TripleGroup like this. 38Execution on MapReduce PlatformMapReduce (MR): Popular large-scale data processing systems of data running on a cluster of commodity grade machines [DEAN04]* http://hadoop.apache.org** http://pig.apache.org, *** http://hive.apache.orgEncode tasks in terms of low level code as map/reduce functions, which are executed in parallel across the cluster.Apache Hadoop* open-source implementationExtended systems provides high-level languages for specifying tasks along with optimizing compilers for generating map/reduce code la database systems. Pig Latin for Apache Pig**, HiveQL for Apache Hive***.Now let me briefly explain about the MapReduce framework, which is a popular large-scale data processing systems running on the cluster of machines; Users encode their tasks as map and reduce functions using procedural languages, and then these tasks are executed in parallel across the cluster. Theres open-source implementation called apache hadoop. These days, extended systems are developed, which accepts high-level languages specifying the tasks, and then corresponding MapReduce tasks are automatically generated and run across the cluster..39Architecture of RAPID+MapReduce Job CompilerHadoop Job TrackerQuery AnalyzerSPARQL parserLogical-to-Physical Plan TranslatorPig Latin Plan GeneratorNTGA Plan GeneratorQueryArchitecture of RAPID+Logical Plan Generator/OptimizerParser LayerPig Latin parser()SPLITLOADSTOREJOINJOINTG_GroupByTG_GroupFilterLOADSTORETG_JoinComparison of Disk Access between VP and NTGAJ1TG_GroupFilter:homepage:name:type:date:name:type:date:publisherTG_GroupByHDFSExample Query:SELECT * WHERE{?producer :homepage ?hpage .?producer :name ?prcname .?producer :type :Producer .?producer :date ?prcdate .?product :name ?prodname .?product :type :Product .?product :date ?prodDate . ?product :publisher ?producer .} J1SPLIT:homepage:name:type:date:publisherHDFSHDFSJ: MapReduce Job Oval: operatorRoundedBox: Intermed. resultNTGA:VP:J2HDFSJOIN(:homepage, :name, )J3JOIN(:name, :type, :date, )J4JOIN(:homepage, :name, :type, :date, :name, type, :date, :publisher)4 MR jobs (4 HDFS read)HDFSHDFSJ2TG_JOIN2 MR jobs (2 HDFS read)TG_FlattenResult file TG_Unnest

Recommended

View more >