SPARQL Basic Graph Pattern Processing with Iterative MapReduce 2010-04-26 Presented by Jaeseok Myung Intelligent Database Systems Lab School of Computer

SPARQL Basic Graph Pattern Processing with SPARQL Basic Graph Pattern Processing with Iterative MapReduceIterative MapReduce

2010-04-26

Presented by Jaeseok Myung

Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea

Copyright 2010 by CEBT

MapReduceMapReduce

MapReduce is easily accessible

The Hadoop project provides an open-source MR implementation

MapReduce gives users a simple abstraction for utilizing parallel and distributed system

Programming Model

– Map(k,v) -> list(k’, v’)

– Reduce(k’, list(v’)) -> list(v’’)

Useful for Massive Data Processing

Center for E-Business Technology MDAC 2010 – 2/23


MR & Cloud ComputingMR & Cloud Computing

MapReduce is a kind of platform

MapReduce utilizes a number of commodity machines

There can be a number of applications using MapReduce

Center for E-Business Technology

MapReduceMapReduce

App.App. App.App. App.App.

MDAC 2010 – 3/23


RDF Data Warehouse using RDF Data Warehouse using MapReduceMapReduce

Data Warehouse using MapReduce

With extensive studies, it has become known that MR is specialized for large-scale fault-tolerant data analyses

Hive, CloudBase

– Data warehousing solutions built on top of Hadoop

Advantages

– Scalability

– Extensibility

– Fault-tolerance

My Research Interest

RDF Data Warehouse using MapReduce



Why RDF Data Warehouse?Why RDF Data Warehouse?

Flexible Data Model

The underlying structure of any expression in RDF is a collection of triples (s, p, o)

Data Integration

RDB-to-RDF (intra)

Linked Open Data (inter)

Incremental Integration

Inference

We can discover some knowledge from what we already know

A goal of data analyses



Approaches & AdvantagesApproaches & Advantages


Building a Data

Warehouse

Building a Data

Warehouse

Conventional DW

SolutionsRDF Data Warehous

e

RDF Data Warehous

e

Centralized

Distributed & Parallel

Distributed & Parallel

Beforethe Cloud

(MR)Cloud Computing(MR)Cloud Computing

• Flexibility• Integration• Inference

• Complexity• Large-scale

data analyses

• Scalability• Extensibilit

y• Fault-

tolerance

• Support Tools

• Simple• Fast

• Performance• Optimization

MDAC 2010 – 6/23


SPARQL BGP Processing with SPARQL BGP Processing with MapReduceMapReduce

Both RDF and MapReduce can benefit a data warehouse

RDF is a data model

– Flexibility, Integration, Inference

MapReduce is a programming model

– Scalability, Extensibility, Fault-tolerance

It has been difficult to create synergy because there have been only few algorithms which connects the data model and the framework

We should focus on a MR algorithm that manipulates RDF datasets

A MapReduce Algorithm for SPARQL Basic Graph Pattern Processing



SPARQL Basic Graph PatternSPARQL Basic Graph Pattern

SPARQL is a query language for RDF datasets

Basic Graph Pattern(BGP) is a set of triple patterns

Triple patterns are similar to RDF triples (s, p, o) except that each of the subject, predicate and object can be a variable

BGP processing is important

– Most of SPARQL queries have one or more BGPs

– BGPs require expansive join operations among triple patterns


SELECT ?x ?y1 ?y2 ?y3WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub:emailAddress ?y2. ?x ub:telephone ?y3}

TP#1TP#1BGPBGP

TP#2TP#2

TP#3TP#3

TP#4TP#4

TP#5TP#5

MDAC 2010 – 8/23


SPARQL BGP Processing with SPARQL BGP Processing with MapReduceMapReduce

Two Operations

MR-Selection

– Extracts RDF triples which satisfy at least one triple pattern

MR-Join

– Merges selected triples



12345

<Prof0><Prof0> rdf:typerdf:type ub:Professorub:Professor

<Prof0><Prof0> ub:worksForub:worksFor <Dept0><Dept0>

<Prof0><Prof0> ub:nameub:name “Professor0”“Professor0”

<Prof0><Prof0> ub:emailub:email “[email protected]”“[email protected]”

<Prof0><Prof0> ub:telephoneub:telephone “000-0000-0000”“000-0000-0000”

<Dept0><Dept0> rdf:typerdf:type ub:Departmentub:Department

…… …… ……


ub:worksForub:worksFor <Dept0><Dept0>

ub:nameub:name “Professor0”“Professor0”

ub:emailub:email “[email protected]”“[email protected]”

ub:telephoneub:telephone “000-0000-0000”“000-0000-0000”

MR-SelectionMR-Selection

MR-JoinMR-Join

MDAC 2010 – 9/23


MR-SelectionMR-Selectionpublic void map() {

Read a triple (s, p, o)

// example, s: Prof0 p: rdf:type o:ub:Professor

for each (triple pattern in a given query) {

if(input triple satisfies a triple pattern) {

make a key and a value

// key = [x]Prof0 (variable name, value)

// value = 1 (# of the satisfied triple pattern)

output (key, value)

}

}

}

public void reduce() {

read input from the map function

// input format: (key, list(satisfied tp_numbers))

for each (value in a list of tp_numbers) {


// key = <1>x, value = [x]Prof0

output (key, value)

}

}



12345

MDAC 2010 – 10/23


MR-SelectionMR-Selection

Conceptually, the MR-Selection algorithm produces temporary tables which satisfy each triple pattern

A result table has variable names as a relational table has attribute names

It also has values for the variable names, as does the relational table

The result table will be used for the next MR-Join operation if necessary


tp1

x

…

x y1

… …

x

…

x y2

… …

x y3

… …

2 3 4 5

MDAC 2010 – 11/23


MapperMapper

Values of Join-key variable

MR-Join: MapMR-Join: Map



12345





<Prof0><Prof0> ub:telephoneub:telephone “000-0000-0000”“000-0000-0000”<Prof0><Prof0>



<Prof1><Prof1>








BGP Analyzer

BGP Analyzer examines a given query before execution and provides join-keys to the map function

BGP Analyzer

BGP Analyzer examines a given query before execution and provides join-keys to the map function

Join-key (shared variable) ?x

MDAC 2010 – 12/23


MR-Join: MapMR-Join: Mappublic void map() {

read input from MR-Selection

// example input (<1>x, [x]Prof0)

// example input (<3>x|y1, [x]Prof0|[y1]Professor0)

get join-key variables and corresponding tp_numbers

to be joined from the BGP Analyzer

// example join-key: x, tp_numbers=(1, 2, 3, 4, 5)

for each (join-key determined by BGP Analyzer) {

if(input is related to the join-key) {


// key = [x]Prof0 (variable name, value)

// value = <tp>1</tp>[x]Prof0 (# of the satisfied triple pattern, variable name, value)

// value = <tp>3</tp>[x]Prof0|[y1]Professor0

output (key, value)

}

}

}



12345

MDAC 2010 – 13/23


MR-Join: ReduceMR-Join: Reduce


ReducerReducer

Constraints for Join-key variable X


12345

<x>1, 2, 3,

4, 5

<x>1, 2, 3,

4, 5








BGP Analyzer

BGP Analyzer can provide triple pattern numbers related to the join-key variable by examining a given query

BGP Analyzer

BGP Analyzer can provide triple pattern numbers related to the join-key variable by examining a given query

Triple pattern numbers related to the join-key variable


ub:worksForub:worksFor <Dept0><Dept0>

ub:nameub:name “Professor0”“Professor0”

ub:emailub:email “[email protected]”“[email protected]”

ub:telephoneub:telephone “000-0000-0000”“000-0000-0000”

MDAC 2010 – 14/23


MR-Join: ReduceMR-Join: Reducepublic void reduce() {

read input from the Map function

// example input ([x]Prof0, [<tp>1</tp>[x]Prof0, <tp>3</tp>[x]Prof0|[y1]Professor0])

get join-key variables and corresponding tp_numbers to be joined from the BGP Analyzer

// example join-key: x, tp_numbers=(1, 2, 3, 4, 5)

create a temporary hashtable H

for each (value in values) {

add an element

// key = <1>x, value = [x]Prof0

// key = <3>x|y1, value = [x]Prof0|[y1]Professor0

} // H will be used for checking whether the input satisfies all related tps.

if(keys in H cover all tp_numbers to be joined) {

make a Cartesian product among values in H

// (a1, b1), (a1, c1) => (a1, b1, c1)


// key = <1|3>x|y1

// value = [x]Prof0|[y1]Professor0

output (key, value)

}

}



Join-key Selection StrategiesJoin-key Selection Strategies

BGP Analyzer provides join-key variables by analyzing a query

How to select join-key variables?

If a BGP has a shared variable

– We can easily select the variable

If a BGP has two or more shared variables

– We applied two heuristics to select join-key variables

– Greedy Selection Select a join-key according to the number of related triple patterns

– Multiple Selection Select join-keys until every triple pattern is participated in a MR-Join

operation

Utilize the distributed and parallel system architecture



SPARQL BGP Processing with MRSPARQL BGP Processing with MR

Advantages

MapReduce can benefit from the multi-way join technique

– If triple patterns share a variable, MR can join them all at once

– It is not unusual that a BGP has several triple patterns sharing the same variable because RDF has a fixed simple data model



12345 ⋈

tp1 ⋈ ⋈ ⋈ ⋈(x)

(x, y1)

(x, y1, y2)

(x, y1, y2, y3)

(x, y1, y2, y3)

(a)

(b)

x

…

x y1

… …

x

…

x y2

… …

x y3

… …

2 3 4 5

tp1

x

…

x y1

… …

x

…

x y2

… …

x y3

… …

2 3 4 5

MDAC 2010 – 17/23


SPARQL BGP Processing with MRSPARQL BGP Processing with MR

Disadvantages

If we have two or more shared variables, we need expansive MR iterations

triple patterns in a query cannot be covered by a certain variable

If we have two shared variables, MR iterations cannot be avoided

To reduce unnecessary MR iteration, join-key selection strategies should be applied


SELECT ?x ?y1 ?y2 ?y3WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub:emailAddress ?y2. ?x ub:telephone ?y3. ?y2 ub:alias ?y4}

123456

⋈(x, y1, y2, y3)

tp1

x

…

x y1

… …

x

…

x y2

… …

x y3

… …

2 3 4 5

y2 y4

… …

6

⋈

MDAC 2010 – 18/23


ExperimentExperiment

Environment

LUBM Dataset

Amazon EC2, Cloudera’s Hadoop Distribution, Amazon EBS

The effect of multi-way join

Multi-way join technique reduces the execution time by joining several triple patterns at once

Some queries do not show a significant difference because they are too simple to take advantages of multi-way join


Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14

2-way

123.391

181.583

69.773256.59

175.53344.198

205.636

232.551

256.031

68.83466.834112.80

273.36947.092

Multi-way

86.423104.03

567.214

126.474

74.16344.526135.04

7140.41

4152.74

773.33763.55786.11772.82542.156

Diff. 36.96877.548 2.559130.11

71.37 -0.328 70.58992.137

103.284

-4.503 3.277 26.685 0.544 4.936

MDAC 2010 – 19/23


ExperimentExperiment

Scalability

As the number of machines increase, the average execution time is decreased

– The MR algorithm makes a sufficient number of reducers so we can utilize a number of machines

While we increase the data size, the algorithm shows scalable execution time



Issues & Future Work – IndexingIssues & Future Work – Indexing

Execution Time of MR-Selection and each MR-Join Iteration

MR-Selection can be a bottleneck because it takes about 40 seconds

The underlying storage structure is important

N-triple format -> HBase, Partitioning

Building an index needs a significant amount of loading time



Issues & Future Work – PipeliningIssues & Future Work – Pipelining

Hadoop’s MR implementation materializes intermediate results into the file system

It takes so much time because of disk I/O

Pipelining

Allows to send and receive data between tasks and between jobs without disk I/O

– Some implementations become available

Hadoop Online Prototype (http://code.google.com/p/hop/)

CGL-MapReduce (eScience 2008)



ConclusionConclusion

There still remain many issues

This work is still in progress

Conclusion

RDF Data Warehouse using MapReduce

– RDF: Flexibility, Integration, Inference

– MapReduce: Scalability, Extensibility, Fault-tolerance

SPARQL Processing with MapReduce

– Synergy effects between RDF and MapReduce

– Issues

System Architecture

Loading(Indexing), Pipelining, Encoding, …


Documents

SPARQL Basic Graph Pattern Processing with Iterative MapReduce 2010-04-26 Presented by Jaeseok Myung Intelligent Database Systems Lab School of Computer