22
Optimizing RDF Chain Queries using Genetic Algorithms DBDBD 2010 Alexander Hogenboom, Viorel Milea, Flavius Frasincar, and Uzay Kaymak Erasmus University Rotterdam {hogenboom, milea, frasincar, kaymak}@ese.eur.nl Full paper entitled “RCQ-GA: RDF Chain Query Optimization using Genetic Algorithms” appeared in Tenth International Conference on E- Commerce and Web Technologies (EC-Web 2009) , pages 181–192, 2009. November 22, 2010

Optimizing RDF Chain Queries using Genetic Algorithms

  • Upload
    ledell

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

Optimizing RDF Chain Queries using Genetic Algorithms. Introduction (1). The application of Semantic Web technologies in an Electronic Commerce environment implies a need for good support tools - PowerPoint PPT Presentation

Citation preview

Page 1: Optimizing RDF Chain Queries using Genetic Algorithms

Optimizing RDF Chain Queries using Genetic Algorithms

DBDBD 2010

Alexander Hogenboom, Viorel Milea, Flavius Frasincar, and Uzay KaymakErasmus University Rotterdam

{hogenboom, milea, frasincar, kaymak}@ese.eur.nl

Full paper entitled “RCQ-GA: RDF Chain Query Optimization using Genetic Algorithms” appeared in Tenth International Conference on E-Commerce and Web

Technologies (EC-Web 2009), pages 181–192, 2009.

November 22, 2010

Page 2: Optimizing RDF Chain Queries using Genetic Algorithms

Introduction (1)

• The application of Semantic Web technologies in an Electronic Commerce environment implies a need for good support tools

• Fast query engines are needed for efficient querying of large amounts of data, usually represented using the Resource Description Framework (RDF)

DBDBD 2010

Page 3: Optimizing RDF Chain Queries using Genetic Algorithms

Introduction (2)

• Problem: optimizing query paths (the order in which different parts of a query are evaluated)

• Existing solution for Semantic Web: 2PO; two-phase optimization (Stuckenschmidt et al., IJWET 2005) :– Iterative Improvement– Simulated Annealing

• A genetic algorithm (GA) appears to be a feasible alternative

DBDBD 2010

Page 4: Optimizing RDF Chain Queries using Genetic Algorithms

RDF and Query Paths (1)

• RDF model is a collection of facts declared using RDF

• Facts are triples in the form of a node-arc-node link consisting of a subject, a predicate, and an object

• RDF sources can be queried using SPARQL

DBDBD 2010

Page 5: Optimizing RDF Chain Queries using Genetic Algorithms

RDF and Query Paths (2)

• We consider a subset of SPARQL queries: chain queries, where a query path is followed by performing joins between its subpaths of length 1

• Example RDF chain query:1. PREFIX c: <http://www.daml.org/2001/09/countries/fips#>2. PREFIX o: <http://www.daml.org/2003/09/factbook/factbook-ont#>3. SELECT ?partner4. WHERE { c:SouthAfrica o:importPartner ?impPartner .5. ?impPartner o:country ?partner .6. ?partner o:border ?border .7. ?border o:country ?neighbour .8. ?neighbour o:internationalDispute ?dispute .9. }

DBDBD 2010

Page 6: Optimizing RDF Chain Queries using Genetic Algorithms

RDF and Query Paths (3)

DBDBD 2010

Bushy query tree Right-deep query tree

Page 7: Optimizing RDF Chain Queries using Genetic Algorithms

RDF Query Path Optimization (1)

• Challenge: determine the right order in which the joins should be computed

• Optimize the overall response time

• Explore a solution space with query paths

• Solution space size exponential in number of concepts

DBDBD 2010

Page 8: Optimizing RDF Chain Queries using Genetic Algorithms

RDF Query Path Optimization (2)

• Solutions are associated with data transmission and processing costs

• Data processing costs are the sum of all join costs, which are influenced by the cardinalities of each operand and the join method used (nested loop)

• Neighbouring solutions in solution space can be identified using transformation rules introduced by Ioannidis and Kang (SIGMOD, 1990)

DBDBD 2010

Page 9: Optimizing RDF Chain Queries using Genetic Algorithms

RDF Query Path Optimization (3)

DBDBD 2010

Page 10: Optimizing RDF Chain Queries using Genetic Algorithms

RDF Query Path Optimization (4)

• Stuckenschmidt et al. (IJWET 2005) propose to use 2PO for RDF chain query optimization:

– Using Iterative Improvement (II), local optima are found by walking through solution space (from random starting points), while only taking steps yielding improvement in solution quality

– The best local optimum thus found is used as starting point for Simulated Annealing (SA); a walk through solution space is performed, where moves not yielding improvement are accepted with a declining probability

• We propose to optimize RDF chain queries using a GA, RCQ-GA

DBDBD 2010

Page 11: Optimizing RDF Chain Queries using Genetic Algorithms

RDF Query Path Optimization (5)

• In a GA, a population of chromosomes (solutions) is exposed to evolution: selection, crossovers, and mutations

• Typically, a GA finds good solutions faster than 2PO (Steinbrunn et al., VLDB 1997)

• A GA tends to spend a lot of time optimizing already good results before it terminates

DBDBD 2010

Page 12: Optimizing RDF Chain Queries using Genetic Algorithms

RDF Query Path Optimization (6)

• Starting point: the BushyGenetic (BG) algorithm proposed by Steinbrunn et al. (VLDB 1997) for traditional query path optimization

• We stimulate quicker convergence through elitist selection, fitness-based selection, a decreased population size, and tighter stopping conditions

• Solutions are encoded using an efficient ordinal number encoding scheme, facilitating easy crossover and mutation operations

DBDBD 2010

Page 13: Optimizing RDF Chain Queries using Genetic Algorithms

RDF Query Path Optimization (7)

• The algorithm iteratively joins two concepts in an ordered list of concepts

• Result is saved on position of first appearing concept

• Example:– (c1, c2, c3, c4): join 3 and 4

– (c1, c2, c3c4): join 1 and 2

– (c1c2, c3c4): join 1 and 2

– (c1c2c3c4)

• Encoded chromosome: ((3,4),(1,2),(1,2))DBDBD 2010

Page 14: Optimizing RDF Chain Queries using Genetic Algorithms

RDF Query Path Optimization (8)

DBDBD 2010

Page 15: Optimizing RDF Chain Queries using Genetic Algorithms

Performance (1)

• We benchmark execution times and solution quality of BG and our adaptation to RDF query environments, RCQ-GA, against those of 2PO

• The effects of a time limit (1 second) on 2PO and RCQ-GA are also assessed

• The entire solution space is considered (i.e., bushy query trees are valid options)

• Each algorithm is tested on chain queries varying in length from 2 to 20 predicates

• Each experiment is iterated 100 timesDBDBD 2010

Page 16: Optimizing RDF Chain Queries using Genetic Algorithms

Performance (2)

DBDBD 2010

Relative deviation of average execution times from 2PO average

Page 17: Optimizing RDF Chain Queries using Genetic Algorithms

Performance (3)

DBDBD 2010

Relative deviation of average solution costs from 2PO average

Page 18: Optimizing RDF Chain Queries using Genetic Algorithms

Performance (4)

DBDBD 2010

Relative deviation of coefficients of variation of solution costs from 2PO average

Page 19: Optimizing RDF Chain Queries using Genetic Algorithms

Conclusions

• Performance of a GA compared to 2PO is positively correlated with the complexity of the solution space and the restrictiveness of the RDF environment

• An appropriately configured GA can outperform 2PO in solution quality, execution time needed, and consistency of solution quality

DBDBD 2010

Page 20: Optimizing RDF Chain Queries using Genetic Algorithms

Future Work

• Optimize parameters (e.g., using meta-algorithms)

• Evaluate performance in a distributed setting

• Experiment with other algorithms, such as ant colony optimization or particle swarm optimization

DBDBD 2010

Page 21: Optimizing RDF Chain Queries using Genetic Algorithms

Questions?

• Feel free to contact:Alexander HogenboomErasmus School of EconomicsErasmus University RotterdamP.O. Box 1738, 3000 DR, The [email protected]

DBDBD 2010

Page 22: Optimizing RDF Chain Queries using Genetic Algorithms

References

• Y.E. Ioannidis and Y.C. Kang. Randomized Algorithms for Optimizing Large Join Queries. In The 1990 ACM SIGMOD International Conference on Management of Data (SIGMOD 1990), pp. 312–321, New York, NY, USA, 1990.

• M. Steinbrunn, G. Moerkotte, and A. Kemper. Heuristic and Randomized Optimization for the Join Ordering Problem. The VLDB Journal, 6(3):191–208, 1997.

• H. Stuckenschmidt, R. Vdovjak, J. Broekstra, and G.J. Houben. Towards Distributed Processing of RDF Path Queries. International Journal of Web Engineering and Technology (IJWET), 2(2-3):207–230, 2005.

DBDBD 2010