Advanced Indexing and Query Processing for Multidimensional Databases

Advanced Indexing and Query Processing

for Multidimensional Databases

Dissertation zur Erlangung desDoktorgrades der Naturwissenschaften

(Dr. rer. nat.)

dem Fachbereich Mathematik und Informatikder Philipps-Universitat Marburg vorgelegt von

Evangelos Dellis

aus Trikala

Marburg/Lahn 2007

Vom Fachbereich Mathematik und Informatikder Philipps-Universitat Marburg als Dissertation am 3. September 2007angenommen.

Erstgutachter: Prof. Dr. Bernhard SeegerZweitgutachter: Prof. Dr. Yannis Theodoridis (University of Piraeus)

Tag der Mundlichen Prufung am 7. September 2007.

Acknowledgement

I would like to express my thanks to all people who supported me during thepast years while I have been working on this thesis. I extend my warmestthanks to my supervisor, Professor Dr. Bernhard Seeger. He took particularcare to maintain a good working atmosphere within the group and to providea supportive and inspiring environment. I am grateful to Prof. Dr. YannisTheodoridis who was readily willing to act as the second referee to this work.I would also like to thank Dr. Jorg Kamper, Max Planck Institute for Terres-trial Microbiology, Marburg, Germany and Dr. Georgios Paliouras, NationalCenter for Scientific Research ’Demokritos’, Athens, Greece for their financialsupport. This work could not have grown and matured without the discus-sions with my colleagues. In particular I would like to thank my colleagues IlyaVladimirskiy and especially Akrivi Vlachou who inspired me to a great extent.Other fruitful discussions which brought this work forward took place with (inalphabetical order): Michael Cammert, Christoph Heinz, Jurgen Kramer, To-bias Riemenschneider and Sony Vaupel. I thank them all. I also appreciate thesubstantial help of the students whose study thesis or diploma thesis I super-vised, including: Yuchen Deng, Liang Dong, Tai Lin, Huimin Yang and LipingYu. Particular thanks go to Oliver Dippel and Mechthild Kessler. Finally Ilike to thank my family and friends, who constantly supported me during thedevelopment of this thesis. Especially my parents who always supported mycareer and encouraged me to find my way.

Evangelos Dellis

Marburg, July 2007.

Abstract

Many new applications, such as multimedia databases, employ the so-calledfeature transformation which transforms important features or properties ofdata objects into high-dimensional points. Searching for ’similar ’ or ’non-dominated ’ objects based on these features is thus a search of points in thisfeature space. To support efficient query processing in these high dimensionaldatabases, high-dimensional indexes are required to prune the search spaceand efficient query processing strategies employing these indexes have to bedesigned. In order to manage the huge volume of such complex data, advanceddatabase systems are employed. In contrast to conventional database systemsthat support exact match queries, the user of these advanced database systemsfocuses on applying similarity search and skyline retrieval techniques.

Based on an analysis of typical advanced database systems - such as mul-timedia databases, electronic market places, and decision support systems -the following four challenging characteristics of complexity are detected: high-dimensionality nature of data, re-usability of existing index structures, novel(more expressive) query operators for advanced database systems and efficientanalysis of complex high-dimensional data. Therefore, the general goal of thisthesis is the improvement of the efficiency of index based query processing inhigh-dimensional data spaces and the development of novel query operators.

The first part of this thesis deals with similarity query processing tech-niques. We introduce a new approach to indexing multidimensional data thatis particularly suitable for the efficient incremental processing of nearest neigh-bor queries. The basic idea is to split the data space vertically into multiplelow- and medium-dimensional data spaces. The data from each of these lower-dimensional subspaces is organized by using a standard multi-dimensionalindex structure. In order to perform incremental NN-queries on top of index-striping efficiently, we first develop an algorithm for merging the results re-ceived from the underlying indexes. Then, an accurate cost model relying ona power law is presented that determines an appropriate number of indexes.Moreover, we consider the problem of dimension assignment, where each di-mension is assigned to a lower-dimensional subspace, such that the cost of

X Abstract

nearest neighbor queries is minimized. Furthermore, a generalization of theiDistance technique, called Multidimensional iDistance (MiD), for k nearestneighbor query processing is presented. Three main steps are performed forbuilding MiD. In agreement with iDistance, firstly, data points are partitionedinto clusters and, secondly, a reference point is determined for every cluster.However, the third step substantially differs from iDistance as a data objectis mapped to a m-dimensional distance vector where m > 1 generally holds.The m dimensions are generated by splitting the original data space into msubspaces and computing the partial distance between the object and thereference point for every subspace. The resulting m-dimensional points canbe indexed by an arbitrary point access method like an R-tree. Our crucialparameter m is derived from a cost model that is based on a power law. Wepresent range and k-NN query processing algorithms for MiD.

The second part of this thesis deals with skyline query processing tech-niques. We first introduce the problem of Constrained Subspace SkylineQueries (CSSQ). We present a query processing algorithm which builds onmultiple low-dimensional index structures. Due to the usage of well perform-ing low dimensional indices, constrained subspace skyline queries for arbitrarylarge subspaces are efficiently supported. Effective pruning strategies are ap-plied to discard points from dominated regions. An important ingredient ofour approach is the workload-adaptive strategy for determining the numberof indexes and the assignment of dimensions to the indexes. Furthermore, weintroduce the concept of Reverse Skyline Queries (RSQ). Given a set of datapoints P and a query point q, an RSQ returns the data objects that havethe query object in the set of their ’dynamic’ skyline. Such kind of dynamicskyline corresponds to the skyline of a transformed data space where pointq becomes the origin and all points are represented by their distance to q.In order to compute the reverse skyline of an arbitrary query point, we firstpropose a Branch and Bound algorithm (called BBRS), which is an improvedcustomization of the original BBS algorithm. To reduce even more the com-putational cost of determining if a point belongs to the reverse skyline, wepropose an enhanced algorithm (called RSSA), that is based on accurate pre-computed approximations of the skylines. These approximations are used toidentify whether a point belongs to the reverse skyline or not.

The effectiveness and efficiency of all proposed techniques are discussedand verified by comparison with conventional approaches in versatile experi-mental evaluations on real-world datasets.

Abstract (in German)

Eine weit verbreitete Technik in vielen neuen Anwendungen, wie z. B. Mul-timedia Datenbanken, ist die sogenannte Feature-Transformation, bei derwichtige Eigenschaften der Datenobjekte in Punkte eines mehrdimension-alen Vektorraums, die sogenannten Feature-Vektoren uberfuhrt werden. DieSuche nach ’ahnlichen’ oder ’nicht-dominierten’ Objekten, die auf diesenFeature-Vektoren basieren ist realisiert als Suche nach Punkten in diesemmehrdimensionalen Vektorraum. Damit eine effiziente Anfragebearbeitung indiesen hoch-dimensionalen Datenbanken unterstutzt werden kann, sind hoch-dimensionale Indexstrukturen notwendig, die den Suchraum beschranken,und es mussen effiziente Anfragebearbeitungsstrategien, die diese Indexeverwenden, entwickelt werden. Bei der effectiven Verwaltung dieser grossenMenge von komplexen Daten kommen Datenbanksysteme fur Nicht-Standard-Anwendungen zum einsatz. Im Gegensatz zu herkommlichen Datenbanksys-temen, die ’exact-match’ Anfragen unterstutzen, fokussiert der Benutzerdieser Nicht-Standard Datenbanksysteme auf Ahnlichkeitsuche und sogenan-nte Skyline-Techniken.

Ausgehend von einer Analyse der typischen hochentwickelten Datenbanksys-teme, wie Multimedia-Datenbanksysteme, elektronische Marktplatze, undEntscheidungsunterstutzungssysteme, werden die folgenden vier grundlegen-den Charakteristika festgestellt: hoch-dimensionale Natur der Daten, Wiederver-wendbarkeit von existierenden Indexstrukturen, neue (ausdrucksstarkere) An-fragetypen fur Nicht-Standard-Datenbanksysteme, und effiziente Analyse vonkomplexen hoch-dimensionalen Daten. Das Hauptziel der vorliegenden Arbeitist deshalb die Verbesserung der Performanz bei der indexbasierten Anfrage-bearbeitung in hochdimensionalen Datenraumen und die Entwicklung neuerAnfrageoperatoren.

Der erste Teil der Arbeit beschaftigt sich mit Methoden der Ahnlichkeitssuche.Wir prasentieren eine neue Indexierungstechnik fur mehrdimensionalen Daten,die besonders geeignet ist fur die inkrementelle Bearbeitung von Nachste-Nachbar (NN) Anfragen. Die Daten werden zuerst in mehreren niedrigdi-mensionalen Raumen verteilt und dann mit mehrdimensionalen Standard-

XII Abstract (in German)

Indexstrukturen indexiert. Damit wir inkrementelle NN-Anfragen effizient un-terstutzen konnen, wird zuerst ein Algorithmus entwickelt, der verantwortlichist fur das Verschmelzen von Resultaten der verschiedene Indexe. Danach wirdein Kostenmodel prasentiert, das auf dem Prinzip ’power-law’ basiert und dieAnzahl der Indexe voraussagt. Ein neuer Algorithmus wird vorgeschlagen,der die Dimensionen auf die verschiedenen Indexe verteilt. Weiterhin wirdeine Generalisierung der ’state-of-the-art’ iDistance-Technik prasentiert, dieMultidimensional iDistance (MiD) genannt wird. Die Daten werden zuerst inCluster verteilt und ein Referenzpunkt fur jedes Cluster definiert. Dann wirdjedes Datenobjekt in einen m-dimensionalen Distanzvektor uberfuhrt (generelgilt m > 1). Diese m dimensionen werden generiert, indem man den origi-nalen Datenraum in m Unterraume verteilt und die partiellen Distanzen zwis-chen den Objekten und den Referenzpunkten fur jeden Unterraum kalkuliert.Die m-dimensionalen Punkte werden dann mit herkomlichen mehrdimension-alen Indexstrukturen indexiert. Ein Kostenmodel wird fur die Berechnung desParameters m verwendet. Effiziente Algorithmen fur Bereichs- und NachsteNachbarn- Anfragen werden zudem prasentiert.

Der zweite Teil der Arbeit befasst sich mit Skyline Anfragebearbeitung-stechniken: Zuerst wird das Konzept von Constrained Subspace Skyline An-fragen (CSSA) definiert. Danach, prasentieren wir einen Algorithmus furdie Anfragebearbeitung, der auf mehreren niedrig-dimensionale Indexstruk-turen basiert. Da wir niedrig-dimensionale Standard-Indexstrukturen be-nutzen, konnen Constrained Subspace Skyline Anfragen effizient unterstutztwerden. Weiterhin werden Strategien zur Beschrankung des Suchraums en-twickelt. Ein sehr wichtiger Bestandteil unserer Technik ist die workload-adaptive Strategie zur Bestimmung der passenden Anzahl der Indexe undder Zuteilung der Dimensionen zu den Indexen. Weiterhin definieren wir dasKonzept von Reverse Skyline Anfragen (RSA). Fur einen gegebenen DatensatzP und einen Anfragepunkt q liefert RSA die Punkte, die den Anfragepunktals Teil der ’dynamischen’ Skyline haben. Diese dynamische Skyline wirdgeneriert, indem man alle Punkte bezuglich der Distanz zu einem Referen-zpunkt transformiert. Wir prasentieren einen Branch-and-Bound-Algorithmus(BBRS Algorithmus) fur Reverse Skyline Anfragen. Damit die Kosten furdie Berechnung der Reverse Skyline sinken, prasentieren wir zusatzlich einenverbesserten Ansatz (RSSA Algorithmus), der auf vorberechnete Approxima-tionen von Skylines basiert. Diese Approximationen werden benutzt, um zuentscheiden, ob ein Punkt in die Reverse Skyline gehort oder nicht.

Effizienz und Effektivitat aller vorgeschlagenen Verfahren werden ausfuhrlichdiskutiert und durch experimentelle Vergleiche mit herkommlichen Methodenauf Daten aus realen Anwendungen verifiziert.

Contents

Part I Preliminaries

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1 Database Systems for Advanced Applications . . . . . . . . . . . . . . . 4

1.1.1 Content-based Retrieval Systems . . . . . . . . . . . . . . . . . . . . 41.1.2 Electronic Market Places . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.1.3 Decision Support Applications . . . . . . . . . . . . . . . . . . . . . . 7

1.2 Similarity Search in Database Systems . . . . . . . . . . . . . . . . . . . . . 81.2.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.2.2 Similarity Query Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2.3 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 Multi-criteria Decision Making in Database Systems . . . . . . . . . 111.3.1 The Skyline Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.2 Skyline Queries - SQL Extension . . . . . . . . . . . . . . . . . . . . 131.3.3 Variations of Skyline Queries . . . . . . . . . . . . . . . . . . . . . . . 13

1.4 Advanced Database Systems: Optimization Strategies forSimilarity and Skyline Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.5 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Query Processing in Multidimensional Spaces . . . . . . . . . . . . . 212.1 Similarity Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.1.1 Multi-Step Query Processing . . . . . . . . . . . . . . . . . . . . . . . . 222.1.2 Multi-dimensional Index Structures . . . . . . . . . . . . . . . . . . 252.1.3 Transformation into Low Dimensional Space . . . . . . . . . . 282.1.4 Transformation into One Dimensional Space . . . . . . . . . . 312.1.5 Scan-based Access Methods - Quantization . . . . . . . . . . . 342.1.6 Metric-based Access Methods . . . . . . . . . . . . . . . . . . . . . . . 352.1.7 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.2 Skyline Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.2.1 Generic Skyline Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 382.2.2 Index-based Skyline Algorithms . . . . . . . . . . . . . . . . . . . . . 40

XIV Contents

2.2.3 Subspace Skyline Algorithms . . . . . . . . . . . . . . . . . . . . . . . 422.2.4 High-Dimensional Skylines . . . . . . . . . . . . . . . . . . . . . . . . . . 442.2.5 Distributed and P2P Skyline Algorithms . . . . . . . . . . . . . 452.2.6 Skyline Algorithms in other Environments . . . . . . . . . . . . 482.2.7 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Part II Similarity Query Optimization

3 Nearest Neighbor Search on Vertically Partitioned HighDimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.2 Incremental Nearest Neighbor Algorithms . . . . . . . . . . . . . . . . . . 62

3.2.1 Best-Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.2.2 TA-Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.2.3 TA-Index+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.2.4 Analysis and Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.2.5 Index Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.3 A Cost Model Based on Power Law . . . . . . . . . . . . . . . . . . . . . . . . 683.3.1 Overview of Existing Cost Models . . . . . . . . . . . . . . . . . . . 693.3.2 The Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.4 Dimension Assignment Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 723.4.1 Linear Greedy Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 733.4.2 Quadratic Greedy Algorithm. . . . . . . . . . . . . . . . . . . . . . . . 733.4.3 Non-Equal Greedy Algorithm . . . . . . . . . . . . . . . . . . . . . . . 74

3.5 NN Search Using External Priority Queues . . . . . . . . . . . . . . . . . 753.5.1 Overview of External Priority Queues . . . . . . . . . . . . . . . . 753.5.2 External Priority Queues within Index-Striping . . . . . . . 763.5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.6 Experimental Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803.6.1 Evaluation of the Cost Model . . . . . . . . . . . . . . . . . . . . . . . 813.6.2 Examination of Attribute Assignment . . . . . . . . . . . . . . . . 813.6.3 Comparison of our k-NN algorithms . . . . . . . . . . . . . . . . . 843.6.4 Comparative Study of Index-Striping and other

Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4 MiD: a Dimension Adaptive Indexing Method forEfficient Similarity Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.2 Overview of the iDistance Technique . . . . . . . . . . . . . . . . . . . . . . . 91

4.2.1 The Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.2.2 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Contents XV

4.2.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924.3 The MiD Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.3.1 Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.3.2 Dimension Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 954.3.3 Index Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.4 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974.4.1 Range Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 974.4.2 k-NN Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 984.4.3 Incremental k-NN Search Algorithm . . . . . . . . . . . . . . . . . 99

4.5 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004.5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014.5.2 Estimating the Number of Node Accesses . . . . . . . . . . . . . 1024.5.3 Estimating the Parameter m . . . . . . . . . . . . . . . . . . . . . . . . 102

4.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1034.6.1 Effect of the Number of Dimensions . . . . . . . . . . . . . . . . . 1034.6.2 Effect of the Dataset Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 1044.6.3 Performance on k-NN Search . . . . . . . . . . . . . . . . . . . . . . . 1054.6.4 Effect of the Parameter m . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Part III Skyline Query Optimization

5 Constrained Subspace Skyline Computation . . . . . . . . . . . . . . . 1115.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.2 Constrained Subspace Skyline Computation for Uniform Data . 114

5.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.2.2 Pruning with the Nearest Neighbor . . . . . . . . . . . . . . . . . . 1165.2.3 STA: A Threshold-based Skyline Algorithm . . . . . . . . . . . 1175.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.3 Improved Pruning for Arbitrary Distributions . . . . . . . . . . . . . . . 1195.3.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195.3.2 Improved Pruning using Multiple Points . . . . . . . . . . . . . . 1225.3.3 Scheduler Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

5.4 Adaptive Subspace Indexing using Low Dimensional R-trees . . 1245.4.1 Number of Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1255.4.2 Attribute Assignment Algorithm . . . . . . . . . . . . . . . . . . . . 1275.4.3 Workload-adaptive Extension . . . . . . . . . . . . . . . . . . . . . . . 128

5.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1295.5.1 Examination of Constrained Subspace Skylines . . . . . . . . 1295.5.2 Scalability with the Dataset Cardinality and Full-space

Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1315.5.3 Adaptation to the Query Workload . . . . . . . . . . . . . . . . . . 132

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

XVI Contents

6 Efficient Computation of Reverse Skyline Queries . . . . . . . . . 1356.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1366.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

6.2.1 Skyline Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 1386.2.2 Reverse NN Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1416.3.1 Dynamic Skyline Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1416.3.2 Reverse Skyline Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

6.4 Branch and Bound Processing of Reverse Skyline Queries . . . . . 1436.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1436.4.2 Description of BBRS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1456.4.3 Analysis of BBRS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6.5 Reverse Skyline Computation using Skyline Approximations . . 1486.5.1 Basic Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1486.5.2 Description of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 1506.5.3 Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

6.6 Approximation of Skylines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1536.6.1 Optimal Approximation for 2-dimensional Skylines . . . . 1546.6.2 A Greedy Algorithm for d-dimensional Skylines . . . . . . . 154

6.7 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1566.7.1 Tuning the Skyline Approximation . . . . . . . . . . . . . . . . . . 1566.7.2 Examination of Reverse Skyline Queries . . . . . . . . . . . . . . 1576.7.3 Scalability with the Dimensionality and Dataset

Cardinality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1586.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

Part IV Conclusions and Outlook

7 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1637.1 Preliminaries (Part I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1637.2 Similarity Query Optimization (Part II) . . . . . . . . . . . . . . . . . . . . 1647.3 Skyline Query Optimization (Part III) . . . . . . . . . . . . . . . . . . . . . 164

8 Discussion on Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

Part I

Preliminaries

1

Introduction

Database systems are key components of todays information technology in-frastructure. With the enormous growth of this infrastructure in the pastdecade, new challenges for database systems have arisen. In both, science andindustry, new applications of database systems have been developed and theirimportance in practice is rapidly increasing. In this chapter, we discuss some ofthe new challenges for database systems, present our approach to tackle thesechallenges and outline the scope of this thesis. Furthermore, we introduce thebasic concepts behind our approach and describe some example applicationswhich are repeatedly used in the following chapters.

Advanced database systems require novel, application-oriented techniquesfor efficient storage, retrieval and management of large amounts of data. Fur-thermore, the rapidly growing, tremendous amount of data exceeds humanability to comprehend, overview and analyze the complex information storedin the advanced databases. As a result, similarity search and multi-criteria op-timization techniques have become more and more popular in the past decade.In contrast to conventional exact-match queries usual for traditional databasesystems, similarity search finds data objects that differ only slightly from thegiven query object. Multi-criteria optimization refers to optimization with re-gard to multiple objective functions aiming at a simultaneous improvement ofthe objectives. The goals are usually conflicting so that an optimal solutionin the conventional sense does not exist.

Many new applications, such as multimedia databases, employ the so calledfeature transformation which transforms important features or properties ofdata objects into high-dimensional points. Searching for objects based on thesefeatures is thus a search of points in this feature space. To support efficient re-trieval in these high-dimensional databases, indexes are required to prune thesearch space. Indexes for low-dimensional databases such as spatial databasesare well studied. Most of these application specific indexes are not scalablewith the number of dimensions, and they are not designed to support ad-vanced database queries, such as similarity and skyline queries. In fact, theysuffer from what has been termed as dimensionality curse, and degradation in

4 1 Introduction

performance is so bad that sequential scanning is making a return as a moreefficient alternative. Multidimensional indexing has extensively been exam-ined in the past, see [52] for a survey of various techniques. The performanceof these algorithms highly depends on the quality of the underlying index. Themost serious problem of multi-dimensional index-structures is that they arenot able to cope with a high number of dimensions. This disadvantage can bealleviated by applying dimensionality reduction techniques. However, the re-duced dimensionality still remains too high for most common index-structureslike R-trees.

The remainder of this chapter is organized as follows. First, we considerin Section 1.1 several advanced database systems and characteristics of dataobjects that are typical for these database systems from the application pointof view. Section 1.2 and Section 1.3 introduce basics of similarity search andmulti-criteria optimization. Based on the demands of the advanced databasesystems, we elaborate in Section 1.4 the new challenges for similarity andmulti-criteria query processing for advanced database systems. An outline ofthe techniques proposed in this thesis is presented in Section 1.5.

1.1 Database Systems for Advanced Applications

This section aims at surveying database systems that have been establishedfor advanced applications in the past decade. We briefly sketch applicationsto similarity search in database systems such as the search for similar colorimages in a multimedia database and the search for similar geometric shapes,as required in CAD databases. Furthermore, we present several applicationscommon in decision support systems, such as Profile-based Marketing andElectronic Market Places.

1.1.1 Content-based Retrieval Systems

Content-based retrieval, where the content of an image (such as objects, color,texture, and shape) is used, form the basis for retrieval. The task of content-based image similarity search is to find all similar or the most similar images inthe database relative to a given query image. Automatic extraction of contentenables retrieval and manipulation of images based on contents. A natural wayto search for images in a multimedia database is based on color distributions.Two color images are defined to be similar if they contain approximatelythe same colors. This is formalized by the means of a color histogram. Afteraccordingly reducing and normalizing the color spectrum of the images toa manageable number of different colors, the images are analyzed. For eachcolor, the ratio of pixels is determined which are correspondingly colored.

Due to the large size of images and the large quantity of images, efficientand effective indexing mechanisms are necessary to facilitate speedy searching.To facilitate content-based retrievals, the general approach adopted in the

1.1 Database Systems for Advanced Applications 5

Fig. 1.1. Similarity Search in Image Databases.

literature has been to transform the content information into a form thatcan be supported by an indexing method. An obvious way to compare colorhistograms is, again, to interpret them as vectors in Euclidean space. A usefuland increasingly common approach to indexing these images based on theircontents is to associate their characteristics to points in a multi-dimensionalfeature space. Each feature vector thus consists of d values, which correspondto coordinates in a d-dimensional space (Figure 1.1).

Another application domain for advanced database systems is the area ofComputer Aided Design (CAD). The demand for managing terabytes of dataleads to the usage of such advanced database systems. To retrieve and analyzeCAD-data, each object is usually mapped to a feature vector. There are severalmethods that extract feature values from 3-dimensional CAD-objects, e.g. [27]gives an overview of these feature extraction methods.

Most current CAD systems are file based systems which do not take ad-vantage from any database technology. Some modern CAD systems currentlyuse object-relational or object-oriented database technology, but only sim-ple operations are supported, such as object retrieval according to the key.Database systems are in this case merely used as storage managers support-ing data independence, concurrency and recovery, but not to support the userwith a powerful search tool.

Other applications that require similarity search support include molecularbiology [3], medical imaging [88], financial databases and knowledge discoveryin general.

1.1.2 Electronic Market Places

Consider Yahoo! Autos1, which is car Internet broker. Assume that a userwho is interested to buy a car, registers with a certain profile. This profilecontains a selection of criteria the ’new’ car should have, for example, cheap1 http://autos.yahoo.com

6 1 Introduction

cars that have not too many miles, or cheap cars that are fast; additionallythe user is able to further narrow the search space, for example by definingcriteria like with air-conditioning, with power locks, or with airbags. Table 1.1shows an example of the current database of available cars. For illustrationpurposes we left out any additional properties of the cars.

Table 1.1. Used car market: offers

Make Price Mileage Horse Power

BMW EUR 30,000 50,000 200 HPFord EUR 8,000 30,000 150 HPToyota EUR 10,000 40,000 170 HP

Now let us consider the preferences of two users. User 1 is interested incheap cars (i.e. price (min)) and cars of low mileage (i.e. mileage (min)), User2 is interested in cheap cars (i.e. price (min)) that have a high speed (i.e.horse power (max)). Assume now that for User 1 only the Ford is interestingsince both the BMW and the Toyota are worse in terms of price and mileagethan the Ford. For User 2 all three cars are of interest (i.e. BMW, Ford andToyota). These initial interesting cars represent the skyline of the current cardatabase. Each user has a different skyline.

Both users do not want to buy their car right away but they both want tostay informed when new cars are offered or cars of their interest are sold. Anew offer that is registered in the database might be the following: (VW, EUR12,000, 20,000, 180 HP). User 1 must be informed about this new offer sinceit is interesting in terms of price and mileage, because it is younger than theFord. However, User 2 must not be informed about this new offer because theVW is worse in terms of price and speed compared to the Toyota. This way,the skyline model serves as an information filter in this application scenario.Now let us assume that the Ford is sold to another user. Of course, bothUser 1 and User 2 must be informed because the Ford is no longer availableand that offer was interesting for both users. In addition, User 1 should beinformed about the Toyota because the Toyota now has become an interestingoffer for this user.

Another application of electronic market places is the following. A personlooking to buy a notebook at a website may care about a large number offeatures like CPU, Memory, Harddisk, Weight, Price, Size, Screen Resolution,Brand and other things specific to his/her choice. There will be too manynotebooks for a person to examine manually. In such cases, computing theskyline over the desired notebook features may remove a large number ofnotebooks whose features are not better than those in the skyline, thus leavinga manageable number of notebooks for evaluation.

To illustrate this example, consider a set of six notebook models as shownin Table 1.2, where the first three are produced by manufacturer FSC and thenext three by manufacturer IBM. If we consider only their weight and price

1.1 Database Systems for Advanced Applications 7

Table 1.2. Notebook Configuration

Model CPU Memory Harddisk Weight Price

FSC1 2.0MHz 1024MB 40GB 2Kg EUR 2200FSC2 1.9MHz 256MB 60GB 1.6Kg EUR 2400FSC3 1.9MHz 512MB 60GB 2.2Kg EUR 1900IBM1 1.8MHz 512MB 40GB 1.9Kg EUR 2100IBM2 1.9MHz 1024MB 40GB 2.8Kg EUR 1800IBM3 1.8MHz 768MB 50GB 1.7Kg EUR 2550

attributes, which are better if minimized, then the skyline will be FSC2, FSC3,IBM1 and IBM2 with the two notebooks FSC1 and IBM3 being dominatedby the competitor’s IBM1 and FSC2 respectively. In this case, regardless ofthe weight being assigned by the customers, the highest scoring notebook willonly come from FSC2, FSC3, IBM1 and IBM2. The same concept can easilybe extended to more attributes such as CPU speed, memory size, etc., wherea higher value is better.

1.1.3 Decision Support Applications

Consider a person who is looking for top ranked movies based on the rank-ing given by other users on a particular movie website, where he might beinterested in only top ranking movies. In such cases, the rating of each usercorresponds to a dimension in the data set and given by a large number ofusers, the data set obviously becomes a high-dimensional one. The skyline ofthe data set will contain only top rated movies while the low ranked querieswill be ignored and not become part of the skyline.

Another scenario is that of a database that records the performance ofplayers who have played in a sport discipline during a particular period of time.For example, consider the official NBA website2 where the Great NBA Players’technical statistics from 1960 to 2003 are kept. Each dimension represents acertain ’skill’, e.g., games played, number of 3-pointers, number of rebounds,number of blocks, number of fouls, total points, total rebounds, total assistsand so on. Obviously, the dataset is high-dimensional and may contains manytuples, each reflecting a player’s ’performance’ for a certain year. Note thatevery player has a tuple for every year he played, so it is possible to haveseveral tuples for one player with different year numbers, like ’Michael Jordanin 1986’ and ’Michael Jordan in 1999’. Finding the skyline in this players’statistics data set makes excellent sense in practice. People are often interestedin finding the skyline players - players who have some outstanding merits thatare not dominated by some other players.

We note that in both the above applications, ranking can be done byproviding some preference functions and requesting users to provide some

2 http://www.nba.com

8 1 Introduction

weight assignments for their preferred features or more trusted users in thelatter case. However, providing such weight assignments for a large number ofdimensions is not always easy without any initial knowledge about the data.

1.2 Similarity Search in Database Systems

During the last decades, an increasing number of applications, such as thosementioned in the previous subsection, have emerged where a large amount ofhigh dimensional data points have to be processed. Instead of exact matchqueries, these applications require an efficient support for similarity queries.Among these similarity queries, the k-nearest neighbor query (k-NN query),which delivers the k nearest points to a query point, is of particular importancefor the applications mentioned above.

1.2.1 Basic Definitions

The search for similar database objects for a given query object is typicallyperformed by following a feature-based approach. The basic idea is to extractimportant properties from the original data objects in the database and totransform them into vectors of a d-dimensional vector space, the so-calledfeature vectors which in most cases results in a high dimensional space. Thatmeans, features are represented as points in multi-dimensional databases, andretrieval requires complex distance functions to quantify similarities of multi-dimensional features. Usually, the feature transformation is defined such thatthe distance between the feature vectors (the feature distance) either corre-sponds to the object distance or is, at least a lower bound thereof (”lowerbounding property”). This way, the similarity search is naturally translatedinto a range query on the feature space.

The feature transformation is usually provided by an expert in the cor-responding application domain, as it has to capture the most important andmost distinguishing properties of the objects in order to achieve a good perfor-mance in query processing. Each similarity measure assigns a positive valueto a pair of objects saying how dissimilar they are. Usually, the similaritymeasure is equal to 0 if and only if the two objects are identical. The higherthe similarity measure is the less similar are the two objects. Therefore, thisis also called the object distance. In all applications mentioned above, the sim-ilarity measure forms a metric, because it is positive, symmetric, and fulfillsthe triangle inequality.

Depending of the application to be supported, several metrics to definethe distances are applied. The most commonly used metric is the Euclideanmetric, also known as the L2 metric. Moreover, two other metrics are widelyapplied. These metrics are the Manhattan metric L1 (or city block metric)and the maximum metric L∞.

1.2 Similarity Search in Database Systems 9

If additional weights w0, w1, ..., wd−1 are assigned to the dimensions, thenthe weighted counterparts of these metrics can be defined, namely weightedManhattan, weighted Euclidean and weighted Maximum metric. Now, we willbriefly explain the most common query types.

1.2.2 Similarity Query Types

Advanced database applications have been designed to support mainly twotypes of similarity queries, namely the range queries and the k-nearest neigh-bor queries.

Range queries are specified by a query object q and a range value ε bywhich the answer set is defined to contain all the objects o from the datasetthat have a distance to the query object q of less than or equal to a rangevalue ε:

range(q, ε) = {o ∈ P |dist(o, q) ≤ ε}

where q is the given query object and dist is the similarity measure applied.We note that dist is highly application-dependent, and may have differentmetrics. An example of a similarity range query is the following: ”find allobjects in the database which are within a given distance ε from a givenobject”. A similarity range query defined as a rectangular range query needsto refine its answers based on the distance.

One major problem of range queries is the undefined size of the result set.The k-nearest neighbor (k-NN) query overcomes this problem by giving theuser the possibility to specify the size k of the answer set. This query typedoes not require a user to provide a query range and is therefore far easier touse than the similarity range query. The k-nearest neighbor query returns thek most similar data points from the dataset and is defined as follows:

Let P be a dynamic dataset of d-dimensional points and q ∈ P . A k-NNquery returns o1, o2, ..., ok ∈ P such that for any point o ∈ P−{o1, , o2, ..., ok}:

dist(q, oi) ≤ dist(q, o)∀1 ≤ i ≤ k

An example of a k-nearest neighbor (k-NN) query is the following: ”findthe k-most similar objects in the database which are closest in distance to agiven object”. Similarity range queries are a specialized form of k-NN queries,as the similarity range query has a fixed search sphere, while the k-NN queryhas to enlarge its search sphere till k most similar objects are obtained. Interms of search operation, k-NN is therefore more complicated.

An incremental similarity search is achieved by the so-called similarityranking query. The basic idea of this query type is to rank the databaseobjects in order of their similarity distance. While incrementally proceedingin the ranking procedure, the next object should be reported shortly after thecorresponding user request, as soon as its correct ranking is ensured. Another

10 1 Introduction

reason for deferring as much as possible of the ranking procedure is that theuser often may be satisfied with only a few answers. Incremental rankingdeliver the result elements of a k-NN query ordered by their distance. Moreformally:

dist(q, oi) ≤ dist(q, oi+1), 1 ≤ i ≤ k − 1

Note that k is not known in advance and that the query is stopped bya user demand. The focus of our work primarily addresses the problem ofincremental ranking (also known as incremental nearest neighbor search).

There are other variations of similarity queries, e.g., the Reverse Near-est Neighbor (RNN) queries [86]. In RNN queries the user only specifies aquery point. Given this query point within a data set, a reverse nearest neigh-bor query finds all points for which the query point is a nearest neighbor.RNN queries arise because certain applications require information that ismost interesting or relevant to them. RNN queries can be extended to RkNNqueries. RNN queries generally require different data structure support andquery processing strategies.

1.2.3 Query Processing

If the feature distance does not directly correspond to the object distance,but is only a lower bound, we talk about the paradigm of multi-step queryprocessing. In a so-called filter step, a similarity query is processed on thefeature space. As the feature distances are lower bounds of the actual ob-ject distances, the result of the similarity query is a set of candidates. It isguaranteed that each object satisfying the similarity query is contained in thecandidate set (no false dismissals) but there may be candidates which are notactual answers to the similarity query. Therefore, the candidates have to betested in the object space in a so-called refinement step. The paradigm yieldsadvantages if only a few candidates have to be tested, i.e. if there is a goodfilter selectivity.

Fig. 1.2. Multi-step Query Processing.

1.3 Multi-criteria Decision Making in Database Systems 11

Figure 1.2 depicts the setting in multi-step query processing: The featurevectors are organized in an index. A query on this filter produces a set ofcandidates. The set is complete (no false dismissals), but may contain severalobjects which are not actual hits to the query. Therefore, the exact objectrepresentation must be loaded to the main memory. The final test whetheran object is an actual answer to the query is called refinement step. From adatabase point of view, there are two main cost factors in this setting: Thecost for the filter step is mainly influenced by the quality of the index. Thecost of refinement is mainly influenced by the filter selectivity, i.e. the size ofthe candidate set. The filter step, however, is identical for any application.Hence, it is desired to support the filter step by the database managementsystem.

This allows us to particularly focus on the following problem: Given aset of d-dimensional points, how can we quickly search for points that fulfill agiven query condition. The query condition could either be a multidimensionalinterval in which all points have to be located (window query) or it could be apoint and we are looking for all points having a distance less than some value εfrom this point (range query) or we are looking for the nearest neighbor of thispoint (nearest neighbor query). All these query types are useful in advanced(non-standard) databases and it depends on the specific application whichone will be used.

1.3 Multi-criteria Decision Making in Database Systems

This section introduces basic definitions of multi-criteria (or skyline) queryprocessing in database systems and is organized in the following way: In Sec-tion 1.3.1 we define formally the Skyline operator (or maximum vector prob-lem) and in Section 1.3.2 we show how the SQL syntax is extended to supportskyline queries. Afterwards, in Section 1.3.3 we examine possible extensionsof the skyline query.

1.3.1 The Skyline Operator

The skyline query can be traced back from 1960s in the theory field, where theskyline is called the Pareto set, and the skyline objects are called admissiblepoints [8] or maximal vectors [15]. The corresponding problem in the theoryfield is known as the maximal vector problem (or Skyline) and was first dis-cussed in [91], [106]. It describes the problem of finding the maximum of a setof vectors. In order to explain what the maximum of a set of vectors or pointsis, consider for example point A(5,5) and B(2,2). If point A is greater in alldimensions than point B then A is clearly the maximum of the set of points{A,B}. But what happens if A is greater in all dimensions than B except forone dimension? A(5,2) and B(2,3). Then a more sophisticated measure for

12 1 Introduction

comparing points has to be applied. The measure is called dominance andwas defined in [106].

More formally, given a space S defined by a set of d dimensions {d1, d2, ..., dd}and a data set P on S, a point p ∈ P can be represented as p = {p1, p2, ..., pd}where every pi is a value on dimension di. A point p ∈ P is said to dominateanother point q ∈ P on space S if: (1) on every dimension di ∈ S, pi ≤ qi; and(2) on at least one dimension dj ∈ S, pj < qj denoted as p ≺ q. The skylineof a space S is a set of points P ′ ⊆ P which are not dominated by any otherpoint on space S. That is, P ′ = {p ∈ P | 6 ∃q ∈ P : q ≺S p}. The points in P ′

are called skyline points on space S.The skyline of a d-dimensional dataset contains the points that are not

dominated by any other point on all dimensions. A point dominates anotherpoint if it is as good or better in all dimensions and better in at least one di-mension. Depending on the definition of better one is looking for all maximumpoints (better means greater) or all minimum points (better means less). Forsimplicity, we assume that skylines are computed with respect to min condi-tions on all dimensions; however, all methods discussed can be applied withany combination of conditions.

To illustrate the above definition, consider for example a database con-taining information about cars. Each tuple of the database is represented asa point in a data space consisting of numerous dimensions. The x-dimensioncould represent the price of a car, whereas the y-dimension captures its mileage(shown in Figure 1.3). According to the dominance definition, a car dominatesanother car because it is cheaper and has run fewer miles. For this skylinequery, a car C is in the answer set (i.e., the skyline) if there does not exist anycar that dominates C; i.e., that is both cheaper as well as with lower mileagereading than C. The user can then tradeoff price with mileage reading fromamong the points in this answer set (called skyline points).

M i l

e a g

e r

e a

d i n

g

Price

O 1000 2000 3000

20K

40K

60K

80K

100K

4000 5000

120K

6000 7000 8000 9000 10000

Fig. 1.3. Example dataset and Skyline.

If the user, from the example in the preceding paragraph, cared not justabout price and mileage reading, but also about the horsepower, the age and

1.3 Multi-criteria Decision Making in Database Systems 13

the fuel consumption, then most cars may have to be included in the skylineanswer set since for each car there may be no one car that beats it on allcriteria, even if it beats it on many.

1.3.2 Skyline Queries - SQL Extension

In [26] an extension to SQL is introduced. The SQL SELECT statement isextended by an optional SKYLINE OF clause describing the attributes (di-mensions) that should be considered for skyline computation:

SELECT ... FROM ... WHERE ...GROUP BY ... HAVING ...SKYLINE OF [DISTINCT] d1 [MIN|MAX|DIFF], ..., dd [MIN|MAX|DIFF];ORDER BY ...

The SKYLINE OF clause describes for each dimension if this particulardimension should be minimized (MIN), maximized (MAX), or simply be dif-ferent (DIFF). The optional DISTINCT specifies how to deal with duplicates.The SKYLINE OF clause is computed after the SELECT, FROM, WHERE,GROUP BY, and HAVING part of the SQL statement but before an ORDERBY part. Point p dominates point q for a skyline query having the followingSKYLINE OF clause:

SKYLINE OF d1 MIN, ..., dk MIN, dk+1 MAX, ..., dl MAX, dl+1 DIFF, ...,dm DIFF

if the following three conditions hold:

• pi ≤ qi for all i = 1, ..., k• pi ≥ qi for all i = (k + 1), ..., l• pi = qi for all i = (l + 1), ..., m

Example. The following SQL query retrieves the skyline of the car databasediscussed above i.e., cheap BMW cars with low mileage:

SELECT *FROM CarsWHERE make = ’BMW’SKYLINE OF price MIN, mileage MIN;

1.3.3 Variations of Skyline Queries

Assume a database containing points in a d-dimensional space with axesd1, d2, ..., dd. A Dynamic Skyline Query (DSQ) [102] specifies m dimensionfunctions f1, f2, ..., fm such that each function fi (1 ≤ i ≤ m) takes as pa-rameters the coordinates of the data points along a subset of the d axes. The

14 1 Introduction

goal is to return the skyline in the new data space with dimensions defined byf1, f2, ..., fm. Given a set of constraints, a Constrained Skyline Query (CSQ)[102] returns the most interesting points in the data space defined by the con-straints. Typically, each constraint is expressed as a range along a dimensionand the conjunction of all constraints forms a hyper rectangle (referred to asthe constraint region) in the d-dimensional attribute space.

A very interested generalization of the skyline concept is the following:Given a set of objects with d dimensions, a skyline query can be issued on anysubset k (k < d) of the d dimensions. The interested subset of d dimensions iscalled subspace and the corresponding skyline query is called Subspace SkylineQuery [105], [131].

Similar to k-NN queries (that return the k nearest neighbors of a point),a k-Skyband Query [102] reports the set of points which are dominated by atmost k points. Conceptually, k represents the thickness of the skyline; the casek = 0 corresponds to a conventional skyline. Other variations of skyline queriesinclude Ranked (top-k) Skyline Queries [102], Group-by Skyline Queries [102],Thick Skyline Queries [78], Relaxed Skyline Queries [71], Enumerating SkylineQueries [102], Spatial Skyline Queries [112] and k-dominant Skyline Queries[31].

1.4 Advanced Database Systems: OptimizationStrategies for Similarity and Skyline Queries

As demonstrated in Section 1.1, advanced database and information systemsconsist of enormous amounts of data. In addition to this huge amount of data,the complexity of data objects increases as well. Multimedia applications storea huge amount of complex data consisting of images, audio and video. Theanalysis of large collections of complex objects yields several new challengesto similarity search and multi-criteria decision making. The challenges forefficient query processing over multi-dimensional database systems are mani-fold, including topics like ’curse of dimensionality’ or conflicting criteria. Thisleads to the problem that the accumulated data have to be managed in sucha way that a user can easily retrieve the desired information under conflictingcriteria from this huge amount of data.

In this thesis, we will concentrate on four topics which play a major rolein many application domains, recently. Those topics are the need for efficientindex structures to cope with high dimensional data, the re-usability of ex-isting multidimensional index structures which works well in low-dimensionalspaces, novel (more expressive) query operators for advanced database systemsand efficient analysis of complex high-dimensional data. To summarize, thisthesis addresses two important problems, similarity search as well as multi-criteria optimization in multidimensional databases. The proposed solutionswork fully content-based and do not rely on additional information.

1.5 Outline of the Thesis 15

• High-dimensionality nature of data: The increased need for efficientindex structures, which effectively support high-dimensional data is of ma-jor importance. Furthermore, these indexes should be dimension-adaptive,which means that subspaces of interest are automatically identified andindexes are build on these subspaces. Moreover, the developed techniquesshould adapt to the underlying query workload and to the specific userdistance function.

• Re-usability of existing index structures: Many indexes have beenproposed to manage high dimensional data. Unfortunately, the majority ofthe indexes cannot be readily integrated into existing DBMSs. As a matterof fact, it is not easy to introduce a new index into an existing commer-cial data server because other components (e.g., the buffer manager, theconcurrency control manager) will be affected. Hence, we propose to buildour index on top of already existing low dimensional R-trees, not sufferingfrom the ’curse of dimensionality ’ in order to exploit well-tested concur-rency control techniques and relevant processing strategies.

• Novel query operators for advanced database systems: Need formore expressive types of skyline queries, which generalizes well-knowntypes of queries on a given dataset, such as constrained and subspaceskylines. In addition, careful analysis of properties and characteristics ofdynamic types of skyline queries is of major importance to the applicationsmentioned above. Finally, novel query operators for analysis of these ad-vanced database systems is needed, which generalizes known query types,such as reverse nearest neighbor queries.

• Analysis of complex high-dimensional data: Specialized techniquesare required to manage objects in modern advanced databases efficientlyand effectively. Various problems and solutions related to the efficient anal-ysis in advanced database systems are discussed. The need for new tech-niques for speeding up the analysis of various types of multimedia data.Moreover, efficient query processing and analysis in decision support sys-tems is gaining much interest lately.

1.5 Outline of the Thesis

The major goal of this thesis is the development of novel techniques for sim-ilarity search and multi-criteria optimization to cope with challenges of theadvanced database systems as elaborated in Section 1.4. The ideas and con-cepts presented in different chapters of this thesis are major extensions ofmaterial that has been published partially, see [40], [41], and [39].

The major contributions of this thesis include:

16 1 Introduction

Similarity Query Processing and Optimization:

• An efficient and effective framework for similarity query processing in highdimensional data spaces.

• A novel cost model for determining the number of low-dimensional indexesfor similarity queries. This model is based on the assumption of the fractaldimension of a dataset.

• Different algorithms for assigning the dimension to the indexes, where eachdimension is assigned to a lower-dimensional subspace, such that the costof nearest neighbor queries is minimized.

• Incremental algorithms for merging the results received from the underly-ing indexes in order to support k-NN queries efficiently.

• Extensions of our framework, such as NN search using external priorityqueues.

• A new dimension-adaptive indexing method for efficient similarity searchin high dimensional data spaces. This method is a multidimensional ex-tension of the original iDistance technique.

• A new mapping technique that tries to retain better distance informationthrough a multi-dimensional transformation.

• The dimension assignment for the multidimensional iDistance is flexibleand adapts to the data distribution.

• Efficient range and k nearest neighbor query processing algorithms.• A cost model that defines the appropriate dimensionality of the indexes.

Skyline Query Processing and Optimization:

• Novel types of queries called Constrained Subspace Skyline Queries (CSSQ)and Reverse Skyline Queries (RSQ) are proposed.

• A progressive algorithm for constrained subspace skyline computation forthe case where the data space is vertically distributed over multiple in-dexes.

• Several pruning strategies to identify dominated regions and to discardirrelevant sub-trees of the indexes.

• We solve effectively the problem of using multiple points for discardingvertically distributed data points during skyline computation.

• A workload-adaptive strategy for determining the number of indexes andthe assignment of dimensions to the indexes is presented.

• The reversed skyline is introduced which captures the notion of influenceof a data point with respect to dynamically dominated points.

• We present two algorithms: the BBRS algorithm, which is an improvedcustomization of the original BBS algorithm and the enhanced RSSA al-gorithm, which is based on accurate pre-computed approximations of theskylines.

• An efficient approximation scheme is proposed to approximate the dy-namic skyline of an arbitrary point.


• An optimal algorithm for the skyline approximation is provided for two-dimensional data.

• A greedy algorithm for computing accurate skyline approximations forthree- and higher-dimensional spaces.

In this chapter, we presented some of the challenges of modern databasesystems. Those challenges include the need for efficient index structures tocope with high dimensional data, the re-usability of existing multidimensionalindex structures which works well in low-dimensional spaces, novel query op-erators for advanced database systems and efficient analysis of complex high-dimensional data. We demonstrated that there is a need for efficient and ef-fective similarity search methods in large databases of high-dimensional data.The aim of this thesis is to improve the efficiency of known similarity searchmethods and to provide new approaches to solve the efficiency and effec-tiveness problems of the existing methods. In addition, the expressiveness ofknown query types should be enhanced and new techniques should be devel-oped. The remainder of this thesis is organized as follows.

The following chapter of Part I provides a brief and rather general overviewof existing methods for similarity search and techniques for skyline query pro-cessing in multidimensional databases.

Part II describes novel similarity search techniques, dealing with the twomajor challenges elaborated in Section 1.4.

Chapter 3 presents a new approach to indexing multidimensional datathat is particularly suitable for the efficient incremental processing of nearestneighbor queries. The basic idea is to use index-striping that vertically splitsthe data space into multiple low- and medium-dimensional data spaces. Thedata from each of these lower-dimensional subspaces is organized by using astandard multi-dimensional index structure. In order to perform incrementalNN-queries on top of index-striping efficiently, we first develop an algorithmfor merging the results received from the underlying indexes. Then, an ac-curate cost model relying on a power law is presented that determines anappropriate number of indexes. Moreover, we consider the problem of dimen-sion assignment, where each dimension is assigned to a lower-dimensionalsubspace, such that the cost of nearest neighbor queries is minimized.

We also report results of a performance study that evaluates the proposedscheme against dimensionality reduction techniques.

In Chapter 4, we present a generalization of the iDistance technique, calledmulti-dimensional iDistance (MiD), for k nearest neighbor query processing.Three main steps are performed for building MiD. In agreement with iDis-tance, firstly, data points are partitioned into clusters and, secondly, a refer-ence point is determined for every cluster. However, the third step substan-tially differs from iDistance as a data object is mapped to a m-dimensionaldistance vector where m > 1 generally holds. The m dimensions are gener-

18 1 Introduction

ated by splitting the original data space into m subspaces and computing thepartial distance between the object and the reference point for every sub-space. The resulting m-dimensional points can be indexed by an arbitrarypoint access method like an R-tree. Our crucial parameter m is derived froma cost model that is based on a power law. We present range and k-NN queryprocessing algorithms for MiD.

Compared with the original iDistance technique, both theoretical analysisand experimental studies have shown that MiD is superior to iDistance forsimilarity queries.

Part III proposes novel query operators and optimization techniques forskyline query processing and analysis. These techniques, takes advantage ofexisting low-dimensional index structures and proposes to use lower and upperskyline bounds in order to filter out true drops and true hits.

Chapter 5 first introduces the problem of Constrained Subspace SkylineQueries. We present a CSS search algorithm which builds on multiple low-dimensional index structures. Due to the use of well performing low dimen-sional index structures, it also can answer constrained subspace skyline queriesfor arbitrary large subspaces. In order to support constrained skylines for ar-bitrary subspaces, we present approaches exploiting multiple low-dimensionalindexes instead of relying on a single high-dimensional index. Effective pruningstrategies are applied to discard points from dominated regions. An importantingredient of our approach is the workload-adaptive strategy for determiningthe number of indexes and the assignment of dimensions to the indexes.

Extensive performance evaluation shows the superiority of our proposedtechnique compared to the original BBS algorithm.

Chapter 6 introduces the concept of Reverse Skyline Queries (RSQ). Givena set of data points P and a query point q, an RSQ returns the data objectsthat have the query object in the set of their ’dynamic’ skyline. Such kind ofdynamic skyline corresponds to the skyline of a transformed data space wherepoint q becomes the origin and all points are represented by their distanceto q. In order to compute the reverse skyline of an arbitrary query point,we first propose a Branch and Bound algorithm (called BBRS), which is animproved customization of the original BBS algorithm. To reduce even morethe computational cost of determining if a point belongs to the reverse skyline,we propose an enhanced algorithm (called RSSA), that is based on accuratepre-computed approximations of the skylines. These approximations are usedto identify whether a point belongs to the reverse skyline or not.

We show in a broad experimental evaluation on real-world and syntheticdatasets that our algorithms can efficiently support reverse skyline queries.

Part IV concludes this thesis with a short summary.Chapter 7 recapitulates and discusses the major contributions of the the-

sis. In addition, Chapter 8 indicates some ideas for possible future work in


the areas of similarity search and multi-criteria optimization for advanceddatabase systems.

2

Query Processing in Multidimensional Spaces

This chapter is dedicated to the foundations of query processing in multidi-mensional spaces, with a strong emphasis on related work. We present im-portant concepts and previous work on similarity search and multi-criteriadecision making aiming at supporting efficient query processing in multidi-mensional spaces.

During the last decades, an increasing number of applications, that requireefficient query processing, have emerged where a large amount of high dimen-sional data points have to be processed. Instead of exact match queries, theseapplications require an efficient support for similarity queries. Among thesesimilarity queries, the k-nearest neighbor query (k-NN query), which deliversthe k nearest points to a query point, is of particular importance for the appli-cations mentioned above. We review existing multi-dimensional indexes thatform the basis of new indexes designed for high-dimensional spaces. Commonprinciples of the well-known index structures for high-dimensional data spacesare introduced.

The skyline operator and its computation have attracted much atten-tion recently. This is mainly due to the importance of various skyline resultsin applications involving multi-criteria decision making. Given a set of d-dimensional objects, the operator returns all objects that are non-dominated.The set of skyline points present a scale-free choice of data points worthy offurther consideration in many contexts. We discuss related work that addressskyline query processing in multidimensional spaces and we review algorithmsfor skyline computation proposed in the literature.

The remaining of this chapter is organized as follows. In Section 2.1, wediscuss similarity query processing in high dimensional data spaces. In Section2.2 we review related work on skyline query processing. Finally, a discussionis presented in Section 2.3.

22 2 Query Processing in Multidimensional Spaces

2.1 Similarity Query Processing

There are several applications that require high dimensional query processingsuch as similarity search. Similarity search has gained increasing importancein many different applications, including medical imaging [88], molecular bi-ology [3], multimedia [60], and many others. The search for similar databaseobjects for a given query object is typically performed by following a feature-based approach. The basic idea is to extract important properties from theoriginal data objects and to map these features into high-dimensional fea-ture vectors, i.e. points in the feature space. Since the choice which features toextract mainly depends on the considered application, numerous feature trans-formations have been proposed. However, the resulting feature space usuallyconsists of numerous dimensions.

2.1.1 Multi-Step Query Processing

Most indexing and processing techniques follow filter-and-refine principles.These techniques consist of two steps: the filter step and the refinement step.Some may iterate over these two steps. In the filtering step, these techniquesfilter out a subset of data objects that are obviously of no interest to theanswers. The aim is to reduce false drops and further processing cost, whileensuring that the answers are still in the candidate set. In the refinement step,the smaller candidate set is checked in more detail (or refined) based on morecomplete information to return answers to the query.

The basic idea is to use an easy-to-compute or simple distance functionthat puts a lower bound on the actual distance to filter out at the first stepobjects that are of no interest. The objects returned by the filtering step areevaluated using the original distance function.

Filter steps are based on approximated objects. For efficiency reasons,these filter steps might be supported by suitable index structures. Refinementsteps discard false positive candidates, i.e. false hits, but are not able to recon-struct false negatives, i.e. false drops, that have been dismissed by a previousfilter step. Thus, the strong requirement for filter steps is to prevent from falsedrops, but the quality of a filter step depends on its selectivity. In general,the evaluation of a single object within a refinement step is expensive, andthe number of candidates should be as small as possible. Thus, the less can-didates a filter step passes to a subsequent refinement step, the better is theperformance of the overall query processing. The following property ensuresthat no false drops are generated:

Lower-Bounding Property. If a filter distance df lower bounds the objectdistance do no false drops are produced by a multi-step query processor. LetO be a set of objects. A filter distance function df and an object distancefunction do fulfill the lower-bounding property, if df underestimates do in anycase, i.e. for all objects holds:

2.1 Similarity Query Processing 23

o1, o2 ∈ O : df(o1, o2) ≤ do(o1, o2)

Consequently, the lower-bounding property ensures that no false drops oc-cur. Similarity queries are processed in the following way:

Multi-Step Range Queries. In [48], a multi-step algorithm is presented thatperforms similarity search for complex distance functions by using appropriatefilter distance functions df on the approximation Appr(o) of the objects o.An approximation Appr(o) might be a k-dimensional feature vector Fk(o)managed in an R-tree, or a one-dimensional value F1(o) managed within aB+-tree. The corresponding lower-bounding property guarantees that no falsedrops occur when applying the filter step.

Fig. 2.1. Multi-step Range Query Processing [109]

The efficiency of the method depends on the selectivity of the filter dis-tance function and on the underlying access method. Figure 2.1 provides aschematic illustration of the algorithm.

Multi-Step k-NN Queries. In the context of fast nearest neighbor search inmedical image databases, Korn et al. [88] suggest an algorithm for k-nearestneighbor search that follows the multi-step query processing paradigm. Inpractice, it often occurs that similarity distance functions have a higher com-plexity and may not be represented by a simple feature vector distance, orthat they are too high in their dimensionality to be efficiently managed by amultidimensional index structure.

In Figure 2.2 we illustrate the architecture of the algorithm and indicatethe interplay with the underlying index structure, managing the approximatedobjects.

Multi-Step Ranking Queries. In [109] Seidl and Kriegel formally intro-duced a criterion for the optimality of multi-step k-nearest neighbor algo-rithms (called r-optimality) with respect to the number of candidates for


Fig. 2.2. Multi-step k-NN Query Processing [109]

which an exact evaluation of the object distance has to be performed. Fur-thermore, they presented an optimal multi-step k nearest neighbor algorithm.The algorithm is depicted schematically in Figure 2.3.

Fig. 2.3. Multi-step Incremental NN Query Processing [109]

The algorithm has two basic components: By the incremental rankingquery on the underlying access method, candidates are iteratively generatedin ascending order according to their filter distance to the query object. Thesecond major component is the result list that manages the k nearest neigh-bors of the query object q within the current candidate set, keeping step withthe candidate generation. The current k-th distance is held in which is set toinfinity until the first k candidates are retrieved from the index and evaluated.During the algorithm dmax will be decreased exactly down to dist(q, k). Basedon this fact the termination of the algorithm is controlled.


2.1.2 Multi-dimensional Index Structures

In this section, we outline some of the index structures suitable for high di-mensional similarity query processing. For a more detailed elaboration onmultidimensional access methods and on the corresponding query processingtechniques, we refer the interested reader to [52] and [25]. Typically, the queryprocessing is a CPU and I/O intensive task and the conventional approach toaddress this problem is to use a multidimensional index structure. The R*-tree [13], for example, is an index structure for multidimensional data objectswhich hierarchically partitions the data space into sub-partitions. The conceptof minimum bounding rectangles (MBRs) is used to conservatively approxi-mate objects that lie within a subpartition. Since the R*-tree is mainly efficientfor low-dimensional data spaces (d < 10, where d denotes the dimensionality),specialized index structures have been proposed to deal with high-dimensionaldata spaces. However, when the dimensionality of the data space is very high,even these specialized index structures mostly fail to efficiently process simi-larity queries. This effect is usually termed as curse of dimensionality. Note,that the degradation of the performance of the index structure is influencedmostly by the intrinsic dimensionality (fractal dimensionality).

The R-tree

The R-tree [62] is a multi-dimensional generalization of the B-tree [12] thatpreserves height-balance. As in the case of the B-tree, node splitting andmerging are required for insertion and deletion of objects. The R-tree hasreceived a great deal of attention owing to its well-defined structure and thefact that it is one of the earliest proposed tree structures for indexing spatialobjects of non-zero size. Many papers have used the R-tree to benchmark theperformance of their structures.

An entry in a leaf node of the R-tree consists of an object-identifier of thedata object and a k-dimensional bounding box which bounds its data objects.In a non-leaf node, an entry contains a child-pointer pointing to a lower levelnode in the R-tree and a bounding box covering all the boxes in the lowerlevel nodes in the subtree. Figure 2.4 illustrates the planar representation andthe structure of an R-tree.

Note that for the search algorithm, the decision to visit a sub-tree dependson whether the covering box intersects the query box. In practice, it is quitecommon for several covering boxes in an internal node to intersect the querybox, resulting in the traversal of several sub-trees. The problem is furthermade worse by the fact that a small query in high-dimensional databases hasa relatively large query width along each dimension. Therefore, the minimiza-tion of overlaps of the covering boxes as well as the coverage of these boxesis of primary importance in constructing the R-tree. The overlap betweenbounding boxes becomes more severe as the dimensionality increases, and it


x

y

O

p 6

p 5

p 7

p 2

p 1 p 3

p 4

N 4

N 3

N 1

N 5

N 6

q

N 2

p 8

(a) Planar Representation

p 1

p 2 p

3 p

4 p

5 p

6 p

7 p

8

N 1

N 2

N 3

N 4

N 6

N 5

N 1 N 2

(b) R-tree Nodes

Fig. 2.4. An R-tree example.

is precisely due to this weakness that the R-tree does not perform well forhigh-dimensional databases.

For the following examples we use the R-tree of Figure 2.4, which indexes aset of points p1, p2, ..., p8, assuming a capacity of two entries per node. Pointsthat are close in space (e.g., p1 and p2) are clustered in the same leaf node(N1), represented as a minimum bounding rectangle (MBR). Nodes are thenrecursively grouped together following the same principle until the top level,which consists of a single root. R-trees (like most spatial access methods) weremotivated by the need to efficiently process range queries, where the rangeusually corresponds to a rectangular window or a circular area around a querypoint. In addition, nearest neighbor queries are of major importance.

The R-tree answers the range query q (shaded area) in Figure 2.4 as fol-lows. The root is first retrieved and the entries (e.g., N5, N6) that intersectthe range are recursively searched because they may contain qualifying points.Nonintersecting entries (e.g., N3) are skipped. Notice that for non-point data(e.g., lines, polygons), the R-tree provides just a filter step to prune non-qualifying objects. The output of this phase has to pass through a refinementstep that examines the actual object representation to determine the actualresult. The concept of filter and refinement steps applies to all spatial querieson non-point objects.

The R-tree nearest neighbor (NN) algorithm proposed in [68] keeps a heapwith the entries of the nodes visited so far. Initially, the heap contains theentries of the root sorted according to their minimum distance (mindist)from q. The entry with the minimum mindist in the heap (N6) is expanded,i.e., it is removed from the heap and its children (N1, N2) are added togetherwith their mindist. The next entry visited is N1 (its mindist is currentlythe minimum in the heap), where the actual 1-NN result (p2) is found. Thealgorithm terminates, because the mindist of all entries in the heap is greaterthan the distance of p2. The algorithm can be easily extended for the retrievalof k nearest neighbors (k-NN). Furthermore, it is optimal (it visits only the


nodes necessary for obtaining the nearest neighbors) and incremental, i.e., itreports neighbors in ascending order of their distance to the query point, andcan be applied when the number k of nearest neighbors to be retrieved is notknown in advance.

The R-tree is the fundament of a whole family (R-tree, R+-tree, and R*-tree) of height-balanced, multidimensional index structures. Minimization ofboth coverage and overlap is crucial to the performance of the R-tree. Abalancing act must be performed such that the near optimal of both theseminimizations can produce the best result. Beckmann et al. [13] proposedthe R*-tree which introduced an additional optimization objective concern-ing the margin of the covering boxes: squarish covering boxes are preferred.Since clustering objects with little variance in the lengths of the edges tendsto reduce the area of the cluster’s covering box, the criterion that ensuresquadratic covering boxes is used in the insertion and splitting algorithms.Intuitively, squarish covering boxes are compact, thereby facilitating packingand reducing overlap.

Dynamic hierarchical spatial indexes are sensitive to the order of the in-sertion of data. A tree may behave differently for the same data set if thesequence of insertions is different. Data objects inserted previously may resultin a bad split in the R-tree after some insertions have been made. Hence it maybe worthwhile to do some local re-organization, even though such local reorga-nization is expensive. The R*-tree deletion algorithm provides re-organizationof the tree to some extent, by forcing the entries in the under flowed nodesto be re-inserted from the root. Performance studies have demonstrated thatdeletion and re-insertion can improve the performance of the R*-tree signifi-cantly [13]. The re-insertion increases the storage utilization, but this can beexpensive when the tree is large.

The X-tree

The X-tree, eXtended node tree [21], is an extension of the R-tree which em-ploys a more sophisticated split algorithm and enlarged internal nodes (knownas super-nodes) in order to reduce the effect of the ”dimensionality curse”. In-stead of a split in the event of a node overflowing due to an additional entry,the notion of super-node is introduced to reduce high overlap between nodes.The main objective of the insertion algorithm is to avoid splits that would re-sult in highly overlapping normal-sized nodes. The decision concerning whennot to split depends on the pre-defined overlap threshold. The R-tree deletionand search algorithms can be directly applied to the X-tree.

In an experimental study with artificial and real-world data sets, the au-thors showed that for high-dimensional data, the X-tree clearly outperformsthe R*-tree [21].


The SS-tree

The SS-tree [125] has a structure similar to that of the R-tree. However,instead of covering boxes, its internal node entry contains a sphere describedby (centroid, radius) that defines the data space of its child nodes. The centroidis the mean value of the feature vectors in the child node, and the radius isthe distance from the centroid to the furthest feature vector. While spheresgenerally lead to smaller access probability of pages compared to volume-equivalent MBRs, they have the disadvantage that overlap-free splits are oftennot possible. The search and update algorithms of the R-tree can be appliedwith minor modifications for handling spheres instead of bounding boxes.Unlike the R-tree and many of its variants, the SS-tree has been designed tohandle similarity search.

The advantages of using spheres are three-fold. First, compared to the2d coordinates required for describing a bounding box, a sphere requires d +1 coordinates or attributes. With less space being utilized for informationregarding data subspaces, more entries can be stored in a node, and hence theSS-tree provides the effect of the X-tree [21]. Second, the longest distance fromthe centroid of a sphere to any point inside it is bounded by the radius. It isa constant, insensitive to any increase in dimensionality. On the contrary, thelongest distance between two points in a rectangular box is the length of thediagonal line, which increases along with any increase in dimensionality. Third,the use of spheres is suitable for similarity search in terms of computation andchecking.

Experiments conducted in [125] indicate that the SS-tree accrues largerbounding volumes than that of the bounding boxes in the equivalent R-trees.Larger volume incurs additional search cost, and to provide efficient rangesearch, the SR-tree [83] was proposed to maintain both spherical and rectan-gular boxes in the internal node entry.

2.1.3 Transformation into Low Dimensional Space

The techniques presented in this subsection reduces the original high dimen-sional space to a new lower dimensional space and make use of already existingindex structures, such as the R-tree, which has a good performance in low di-mensional spaces. The idea is to pick the most important features to representthe data, and an index is built on the reduced space. To answer a query, it ismapped to the reduced space and the index is searched based on the dimen-sions indexed. The answer set returned contains all the answers and the falsepositives.

Global Dimensionality Reduction

The Principal Component Analysis (PCA) [79] is a widely used method fortransforming points in the original (high-dimensional) space into another


(usually lower dimensional) space. It examines the variance structure in thedatabase and determines the directions along which the data exhibits highvariance. The first principal component (or dimension) accounts for as muchof the variability in the data as possible, and each succeeding component ac-counts for as much of the remaining variability as possible. Using PCA, mostof the information in the original space is condensed into a few dimensionsalong which the variances in the data distribution are the largest.

The TV-tree (Telescopic-Vector tree) [93] was the first to address the prob-lem of high-dimensional indexing in the context of image and time-seriesdatabases. The TV-tree is an R-tree-like index structure and it is designedespecially for real data that are amenable to the Karhunen-Loeve-Transform,i.e. the principal component analysis. Such data yield a high variance andtherefore a good selectivity in the first few dimensions. The idea is to con-tract and extend the feature vectors dynamically to discriminate among theobjects. That is, for nodes higher up in the R-tree-like structure, less featureswill be used to partition the data space. As the tree is descended, more fea-tures are used to define the minimum bounding region. The benefit of this treelies in its ability to adapt dynamically and use a variable number of dimen-sions to distinguish between objects or groups of objects. Since this numberof required dimensions is usually small, the method saves space and leads toa larger fan-out, which reduces the effect of the ”dimensionality curse”. As aresult, the tree is more compact and shallower. However, features have to beselected based on some pre-defined importance or ordering.

The authors compared the TV-tree with the R*-tree and showed that theirmethod saves up to 80% in disk accesses [93].

Local Dimensionality Reduction

When the data set is globally correlated, principal component analysis is aneffective method in reducing the number of dimensions with little or no lossof information. However, in practice, the data points tend not to be globallycorrelated, and the use of global dimensionality reduction may cause a signifi-cant loss of information. As an attempt to reduce such loss of information andalso to reduce query processing due to false positives, a Local DimensionalityReduction (LDR) technique was proposed in [28]. It exploits local correlationsin data points for the purpose of indexing.

Clustering algorithms have been used to discover patterns in low-dimensionalspace [2], where data points are clustered based on correlation and interest-ing patterns are then discovered. In [28], a clustering algorithm is first usedto discover correlated clusters in the data set. Next, the local dimensional-ity reduction (LDR) technique involving local correlation analysis is used toperform dimensionality reduction of each cluster. For each cluster, a multi-dimensional index is built on the reduced dimensional representation of the


data points, and a global index is maintained on top of these individual clus-ter indexes. The data points that do not belong to any cluster are consideredas outliers, and they may not be indexed depending on how big the orig-inal dimensionality d is. The LDR technique allows the user to determinethe amount of information loss incurred by the dimensionality reduction, thequery precision and hence the cost.

The sensitivity of the LDR technique is affected by the degree of corre-lation, number of clusters and cluster size. It is shown in [28] that the LDRtechnique is better than the general global dimensionality reduction techniqueGDR in terms of precision as well as effectiveness.

In general, dimensionality reduction can be performed on the data setsbefore they are indexed as a means to reduce the effect of dimensionalitycurse on the index structure. Dimensionality reduction is lossy in nature, andhence query accuracy is affected as a result. Further, as more data points areinserted into the database, correlation between data points may change, andthis further affects the query precision.

The Fastmap

Fastmap [47] transforms a matrix of pairwise distances into a set of low-dimensional points, providing the basis for the construction of effective indexesusing existing multi-dimensional indexing methods. That is, it is intended formapping objects into points in a k-dimensional space so that distances be-tween objects are well-preserved without requiring a domain expert to pro-vide a set of k feature-extraction functions. However, Fastmap assumes thata static data set is given [47].

The authors compare their method with the traditional Multi DimensionalScaling (MDS) technique on real and synthetic datasets and showed thatFastmap achieves significant time savings over MDS, even for small datasets[47].

The ’dimensionally-reduced’ R-tree

Instead of the signal processing techniques used in TV-trees and Fastmap,[57] makes use of a space-filling curve to achieve the goal of dimensional-ity reduction. Several examples of space-filling curves have been proposed inthe literature: the z-ordering [99], Gray code [46] and the Hilbert curve [75].Space-filling curves provide the means for introducing a locality-preservingtotal ordering on points in a multi-dimensional space. This property has beenexploited in many indexing structures and is widely recognized as a generaltechnique for reducing a multi-attribute indexing problem to a classical single-attribute indexing problem.

The choice of a particular space-filling curve for reducing the dimension-ality of the data affects only how data points in a d-dimensional space are


mapped to corresponding coordinates in a k-dimensional space. This deter-mines how a query box is decomposed into query boxes of lower dimensionality,but not fundamentally how searches and update operations are carried out.In [57], the Hilbert space-filling curve is adopted for the purpose of transfor-mation, based on the fact that mapping based on Hilbert curve outperformsthe others under most circumstances [75].

Search operations on the ”dimensionally-reduced” R-tree are somewhatdifferent from that of its conventional counterpart because it is no longerpossible to search the d-dimensional space directly; instead, the d-dimensionalsubspace defined in a range query must now be transformed into a collectionof k-dimensional search regions. The search algorithm is a generalized form ofits counterpart for regular ”native” R-trees; instead of searching only a singleregion, the search algorithm accepts as input a disjoint set of search regions.When traversing a dimensionally-reduced R-tree, a sub-tree will be exploredonly if it encompasses a region which intersects at least one region in the inputcollection of search regions. This guarantees that each node in the tree willonly be visited once, although it remains necessary for multiple paths to betraversed.

Experimental studies in [57] have indicated that the technique is capableof providing significant performance gains at high dimensions. However, thecomputation cost, which is incurred by the decomposition of a region intomany value ranges, can be rather high.

The Tree-Striping Technique

In [19] a technique called tree striping was introduced. It generalizes the well-known inverted lists and multidimensional indexing approaches. A theoreticalanalysis shows that both, inverted lists and multidimensional indexing ap-proaches are far from being optimal. Consequently, the tree-striping approachproposes the use of a set of a number k of lower-dimensional indexes. Basedon a uniform cost model, the authors first present a formula for the optimalnumber of multi-dimensional indexes. Two techniques are then presented forassigning dimensions to indexes. The first one simply follows a round-robinstrategy, whereas the other exploits the selectivity of the different dimensionssuch that the dimension with the highest selectivity is assigned to the firstindex and so on.

The experimental evaluation in [19] on synthetic and real world datashowed that their approach clearly outperforms both multi-dimensional in-dex structures and the inverted list approach.

2.1.4 Transformation into One Dimensional Space

Another approach is based on the mapping of a d-dimensional data point intoa one-dimensional value and then makes use of an existing one-dimensionalindex such as a B+-tree. Usually, the performance of the multi-dimensional


indexing approach is slightly better, however, as the second category makesuse of existing and proven technology, there exist some advantages, too. Themapping techniques can be implemented much easier and important issuessuch as recovery or concurrency control can be considered solved problems,as the techniques make use of existing B+-tree indexes.

None of the following mappings from a d-dimensional point to a one-dimensional key is bijective. As an implication, we cannot process a givenquery by only using the one-dimensional keys. But fortunately, one can atleast guarantee that there exist no false drops. On the other hand, the answerset might contain false hits. Therefore, we have to refine the candidate set bytaking the d-dimensional feature vectors into account.

The Pyramid Technique

The Pyramid technique [20] is an index structure that maps a d-dimensionalpoint into a one-dimensional point and uses a B+-tree to index the one-dimensional points. In the data pages of the B+-tree, the Pyramid techniquestores both the d-dimensional points and the one-dimensional key. Thus, noinverse transformation is required and the refinement step can be done withoutany further look-ups. The specific mapping used by the Pyramid technique iscalled Pyramid-mapping. It is based on a special partitioning strategy that isoptimized for range queries on high-dimensional data. The Pyramid techniqueachieves the partitioning by first dividing the d-dimensional space into 2dpyramids having the center point of the space as their top. In a second step,the single pyramid is cut into slices parallel to the basis of the pyramid formingthe data pages.

The Pyramid technique does not support the notion of distance and there-fore does not capture metric relationships between data points. To support knearest neighbor search, iterative range queries must be used and hence theindex may not be efficient for exact k-NN search.

In an extensive performance analysis the authors show that for almost hy-percube shaped queries the Pyramid-technique clearly outperforms the X-treeand the sequential scan on synthetic and real-world data [20]. Most impor-tant, for uniformly distributed feature sets, the performance of the Pyramid-technique does not degenerate when the dimension of the feature space in-creases. However, for queries yielding a low selectivity (i.e. a large answer set)or extremely skewed queries, the sequential scan outperforms the Pyramid-technique.

The iMinMax Technique

A new index scheme, called iMinMax was introduced in [98]. iMinMax mapshigh-dimensional points to single dimension values, dependant on their min-imum and maximum coordinate values. As other dimensionality reduction


methods, this scheme is also based on the B+-tree. θ in the iMinMax tech-nique is a tuning parameter which makes it adaptive to data distributions.As such, it performs better than the Pyramid technique [20] when the datais skewed. Since θ is a global parameter, the iMinMax works well when thenumber of natural clusters is small.

Experiments carried out by the authors showed that this method is sig-nificantly more efficient than the Pyramid technique [98]. The authors statethat performance difference is expected to increase as the data volume anddimensionality increase, and for skewed data distribution.

The iDistance Technique

In [130], [77] an efficient method called iDistance for k-nearest neighbor searchin high-dimensional spaces was presented. iDistance partitions the data andselects a reference point for each partition. The data in each partition aretransformed into a single dimensional space based on their similarity withrespect to a reference point. This allows the use of a B+-tree for k-NN queries.

The design of iDistance is motivated by the following observations. First,the (dis-)similarity between data points can be derived with reference to achosen reference point. Second, data points can be ordered based on their dis-tances to a reference point, and the points which are near in high-dimensionalspace are expected to have similar distance to the reference point. Third, dis-tance is essentially a single value. Hence, high-dimensional data can be rep-resented in single-dimensional space, thereby enabling the reuse of existingsingle-dimensional indexes such as the B+-tree.

With appropriate choice of partition schemes, iDistance is one of the mostefficient access techniques with respect to k-NN search. It is true that thepower of pruning methods deteriorates with increasing dimensionality andparameter k, but it is less dramatic for iDistance. In [130], [77] the authorsshows the improvement factor of iDistance with increasing data space dimen-sionality, compared with other techniques including the linear scan and theiMinMax approach.

The P+-Tree

Based on the fact that the Pyramid technique and the iMinMax are efficientfor window queries, and also that the iDistance is superior for k-NN queries,Zhang et. al. in [132] present a new structure, called the P+-tree, that supportsboth window queries and k-NN queries under different workloads efficiently. Inthe P+-tree, a B+-tree is employed to index the data points as follows. Thedata space is partitioned into subspaces based on clustering, and points ineach subspace are mapped onto a single dimensional space using the Pyramidtechnique, and stored in the B+-tree. The crux of the scheme lies in thetransformation of the data which has two crucial properties. First, it mapseach subspace into a hypercube so that the Pyramid technique can be applied.


Second, it shifts the cluster center to the top of the pyramid, which is thecase that the Pyramid technique works very efficiently. The authors presentalgorithms for window and k-NN search.

Through an extensive performance study, they show that the P+-tree hasconsiderable speedup over the Pyramid technique and the iMinMax for win-dow queries and outperforms the iDistance for k-NN queries [132].

2.1.5 Scan-based Access Methods - Quantization

To reduce I/O cost of range searching, besides pruning search space and re-ducing dimensionality, compression or quantization has also been considered.The basic idea of compression is to use less disk storage and hence incur fewerI/Os. The index itself or a compressed form of representative values acts as afilter to reduce the number of data points that are clearly not in the answerset. For each one that satisfies the search condition, the real data needs to betested.

The VA-file

The VA-file was presented by Weber et al [123]. The authors prove in the pa-per that under certain assumptions, above a certain dimensionality no indexstructure can process a nearest neighbor query efficiently. Thus, they suggestto speed-up the sequential scan. The basic idea of the VA-file is to keep twofiles: a bit-compressed (quantized) version of the points and the exact repre-sentation of the points. Both files are unsorted; however, the ordering of thepoints in the two files is identical. Query processing is equivalent to a sequen-tial scan of the compressed file with some look-ups to the second file wheneverthis is necessary.

The VA-file is based on the idea of object approximation by mapping acoordinate to some value that reduces storage requirement. The basic idea isto divide the data space into 2b hyper-rectangular cells where b is the (tunable)number of bits used for representation. The data space consists of 2b hyper-rectangular cells, each of which can be represented by a unique bit string oflength b. A data point is then approximated by the bit string of the cell thatit falls into.

The main problem of the VA-file is as follows. The precision of the signatureaffects the size of the query hyper-cube. The generation of the signature of therange query is to map a float vector to a small number of bits, where it losesin terms of accuracy. However, the number of bits needs careful tuning, andunfortunately, there are no guidelines for such tuning. The VA-file outperformsboth the R*-Tree and the X-Tree when the dimension is higher than six, butits performance is very sensitive to the actual data distribution [123].


The IQ-tree

The idea of quantization based compression has also been integrated to indexbased query processing. The IQ-tree [17] is a three-level index structure whichcombines the ideas of a tree, a scan, and the quantization. The techniqueperforms an I/O optimizing scan through the data pages if the index selectivityis not high enough to compensate for the cost of seek operations. In contrastto the VA-file, the quantization grid of the IQ tree is related to the data pageregions and its resolution is automatically optimized during index constructionand maintenance.

In the experiments it was shown that the IQ-tree yields a performancethat is the ”best of two worlds”. In low- and medium-dimensional featurespaces, the IQ-tree performs comparable to the X-tree and clearly outperformsscan-based approaches like the VA-file. On the other hand, when indexinghigh-dimensional feature sets, the IQ-tree performs comparable to the VA-file and clearly outperforms the X-tree. Thus, the IQ-tree shows a betteroverall performance than competing techniques for low-, medium-, and high-dimensional feature spaces.

Techniques like the VA-file [123] and the IQ-tree [17] exploit the fact thatthe simple sequential scan often provides better query performance than indexapproaches due to the lack of random disk seeks. The VA-file uses a specialcompression technique in order to reduce the total amount of data that hasto be scanned. The IQ-tree is a sophisticated technique that combines theparadigm of index selectivity and the compression concept of the VA-file. Bothtechniques are well suited for efficient similarity search in high-dimensionalfeature spaces.

2.1.6 Metric-based Access Methods

Feature transformation often involves complex transformations of the multi-media objects such as the feature extraction, normalization, Fourier trans-formation or wavelet transformation. For efficient similarity search, featurevectors are usually stored in a high-dimensional index structure and the dis-tance metric is used to guide in building the index structure and also to searchefficiently on the index structure. The basic idea of metric indexes is to use theproperties of similarity to build a tree, which can be used to prune branchesin processing the queries.

Indexing in a metric space supports fast similarity and nearest neighborretrieval. However, such indexes do not support range queries since the datapoints are not indexed based on individual attribute values. In cases whereboth applications are required, a metric-based index and an attribute-basedhigh-dimensional index can be supported to provide fast retrieval.


The M-tree

In [36] an access method, called M-tree, is proposed to organize and searchlarge data sets from a generic ”metric space”, i.e. where object proximity isonly defined by a distance function satisfying the positivity, symmetry, andfulfils the triangle inequality. The M-tree extends the domain of applicabilitybeyond the traditional vector spaces.

The M-tree [36] is a height-balanced tree that indexes objects based ontheir relative distances as measured by a specific distance function. Whileleaf nodes store feature values of objects, internal nodes store routing objects(Or) and corresponding child node pointers. All objects in the subtree (knownas the covering tree) are within the distance r(Or) of Or. Ciaccia et al. [36]present preliminary results showing that the M-tree outperforms a well-knownvector space data structure (the R*-tree [13]) when applied to a vector space.

The Slim-tree

Another metric based index is the Slim-tree [80], which is a height balancedand dynamic tree structure that grows from the leaves to the root. In an in-dex node, the first entry is the entry with the representative object of currentnode. Each entry in the index node contains an object, which is the represen-tative of its subtree, the distance of the object to the representative of currentnode (region), the covering radius of that region, pointer to its subtree andentry number in its subtree. Each entry of leaf node contains object, objectidentifier, distance of this object to the representative in current leaf node re-gion. The main idea of the Slim-tree is to organize the objects in a hierarchicaltree structure, using representatives as the center of each minimum boundingregion which covers the objects in a subtree. The aim is to reduce the overlapbetween the regions in each level of the metric tree.

The split algorithm of the Slim-tree is based on the concept of minimalspanning tree [90]. It distributes the objects by cutting the longest line amongall the closest connecting lines between objects. If none exists, an uneven splitis accepted as a compromise. Results obtained from experiments with real-world data sets show that the Slim-tree outperforms the M-tree by a factorof 35% [80].

2.1.7 Summary

High-dimensional indexes have been designed based on one of the followingapproaches to resolving the ’dimensionality curse’. The first approach is toreduce the amount of overlap in the nodes of an R-tree index structure byincreasing the fan-out of selected nodes [21], or using bounding spheres whichare more suitable for high-dimensional data and nearest neighbor search [125].

The second approach seeks to reduce the dimensionality of data in one oftwo ways: (1) by organizing the directory using only a small number of dimen-sions that are most representative of the data objects (as in the TV-tree [93]),

2.2 Skyline Query Processing 37

or (2) by transforming data objects from a high-dimensional space into somelower dimensional space (as in the case of Fastmap [47]). Other dimensionalityreduction approaches include application of principal component analysis [79],local dimensionality reduction [28] and transformation into one dimensionalspace [20], [77].

The third approach is to index these data points based on metric to di-rectly support similarity search [36]. The data points are organized based onsimilarity or distance in contrast to attribute values based indexing. Basedon the triangular inequality relationships between data points, data pointsare retrieved based on their relative distance or similarity with reference toindex points. Other approaches include scan-based access methods [123] andquantization [17].

2.2 Skyline Query Processing

Skyline queries are a specific, yet relevant, example of preference queries [33],[84], and have been recognized as a useful and practical way to make databasesystems more flexible in supporting user requirements [32]. Skyline queriesover multi-dimensional data sets have, justifiably, become a topic of intensestudy in recent years. The set of skyline points present a scale-free choice ofdata points worthy of further consideration in many contexts. Informally, theskyline of a multidimensional data set contains the ”best” tuples according toany preference function that is monotonic on each dimension. Furthermore,for every point p in the skyline there exists a monotone function f such thatp minimizes f [26].

The skyline query can be traced back from 1960s in the theory field, wherethe skyline is called the Pareto set, and the skyline objects are called admis-sible points [8] or maximal vectors [15]. The corresponding problem in thetheory field is known as the maximal vector problem [91], [106]. Several main-memory algorithms [91], [15], [16] have been proposed to solve the maximalvector problem. However, in the database context, those main-memory al-gorithms are inefficient for the skyline query, due to the large sizes of datasets.

This section is organized in the following way: Section 2.2.1 reviews genericalgorithms for the skyline computation which work on top of existing databasemanagement systems. In Section 2.2.2, we discuss index-based skyline algo-rithms. Sections 2.2.3 and 2.2.4 reviews skyline algorithm in subspaces and inhigh dimensional data spaces respectively. Algorithms based on a distributedand/or P2P access model are discussed in Section 2.2.5. Section 2.2.6 reviewsskyline algorithms in other environments, such as partially ordered domainsand stream monitoring. Finally, in Section 2.2.7 we discuss shortly the afore-mentioned approaches.


2.2.1 Generic Skyline Algorithms

In this section we review related work, which concentrate on how to implementthe skyline operator within a database system. In other terms, how can anapplication, running on top of a standard database system, evaluate a skylinequery? Note that this is precisely the problem that one has to face whenusing any currently available commercial database system. However, thereare several different (physical) ways to implement the Skyline operator and inthis section we review the proposed approaches presented in the literature.

Block Nested Loops

Block Nested Loops (BNL) [26] compares each point of the database withevery other point, and reports it as a result only if it is not dominated by anyother point. A window W is allocated in main memory, and the input relationis sequentially scanned. In this way, a block of skyline objects is produced inevery iteration. In case the window saturates, a temporary file is used to storepoints that cannot be placed in W . This file is used as the input to the nextpass. Eventually the algorithm terminates, since at the end of each pass thesize of the temporary file can only decrease.

Divide and Conquer

Divide and Conquer (D&C) [26] divides the data space into several regions,calculates the skyline in each region, and produces the final skyline from thepoints in the regional skylines. However, a recent analysis [56] shows thatthe average performance of D&C deteriorates with increasing skyline dimen-sionality, d. Both BNL and D&C extend their main-memory counterparts bytaking into consideration the memory size.

Sort Filter Skyline

BNL makes many unnecessary comparisons between objects that are not inthe skyline. When BNL is given more main-memory allocation (and thus, itswindow is larger) its performance deteriorates. This is because any skylinepoint is necessarily compared against all the points in the window. To elim-inate those comparisons, Sort Filter Skyline (SFS) [34], [35], which is basedon the same principle as BNL, improves performance by first sorting the dataaccording to a monotone function. The skyline objects are output to a win-dow. If the window is large enough, each object is compared only with theskyline objects, and objects put into the window are guaranteed to be in theskyline. Otherwise, some objects will be put in a temporary file as BNL does.An important property of SFS is that always takes the minimum number ofiterations.


Experiments in [34] have shown that SFS provides a distinct improvementover BNL. Furthermore, it was shown in [34] that BNL is ill-behaved relation-ally. If the data set is already ordered in some way (but not for the benefit offinding the skylines), BNL can perform very badly.

Linear Elimination Sort for Skyline

Godfrey et al. [56] proposed another generic algorithm called Linear Elimina-tion Sort for Skyline (LESS), aimed at improving the asymptotic complexity.LESS also requires the data to be pre-sorted, while integrating in the first stepof a standard external sort-merge algorithm an elimination-filter window, soas to earlier discarding some dominated tuples. Further, LESS combines thelast merge pass of the sorting algorithm with the first skyline-filter pass. LESSachieves O(d∗N) average-case cost, where d is the dimensionality and N is thecardinality. LESS has all of SFSs benefits with no additional disadvantages.Moreover, LESS saves a pass since it combines the last sort pass with the firstskyline pass. LESS should consistently perform better than SFS.

Although results in [56] show that LESS consistently outperforms SFS, thedifference between LESS and SFS closes as the dimensionality increases. Thisis because, for higher dimensionality, more time is spent by LESS and SFS inthe skyline-filtering phase. This is simply due to the fact that more recordsare skylines. However, LESS is nowadays the best server-side algorithm whenno indices are available.

Sort and Limit Skyline Algorithm

Based on the observation that LESS would not be applicable in scenarios inwhich one has no direct control on the data server (thus on the algorithmused to sort tuples), Bartolini et. al. in [11] introduce a novel algorithm,called SaLSa (Sort and Limit Skyline algorithm). SaLSa takes from the SFSalgorithm by Chomicki et al. [34] the idea of pre-sorting the input relationbefore running the filter step in which dominance tests are executed. Thus,both SFS and SaLSa need to perform a topological sort of the input data,and as such both can progressively return undominated tuples as soon as theydiscover them. However, while for SFS a ”good” sorting function is one thatputs in the first positions those tuples that are likely to dominate many othertuples, thus leading to reduce the number of dominance tests, sorting data inSaLSa is mainly used as a means to stop fetching tuples from the input stream,thus effectively limiting the number of tuples to be read. In other terms, SaLSarelies on sorting functions that can guarantee that all tuples beyond a certainpoint in the input stream are dominated by some already seen tuple. Threespecific symmetric sorting functions were considered, namely volume (vol),sum of coordinates (sum), and maximum coordinate (max), and proved thatthe latter has an optimal limiting performance.


Experiments in [11] shows that SaLSa achieves a better running time thatSFS in terms of reducing the number of tuples to be fetched from the dataserver.

2.2.2 Index-based Skyline Algorithms

Due to the poor performance of the above methods for computing the skyline,the techniques presented in this subsection make use of an index structurein order to prune the search space and speed up the computation. Theseindex-based techniques have some desirable properties, such as progressive(or online) behavior. Progressive processing of the skyline means that theskyline points are returned immediately after they have found, without theneed to wait until all points have been examined.

Bitmap

Tan et al. [116], [44] presents the first index-based progressive technique,namely the Bitmap method. Bitmap maps each object to a bit string, andthe skyline is computed using efficient bit operations. As all previous algo-rithms had to look at the points in the data set at least once to return thefirst Skyline point (most algorithms required multiple passes over the dataset), Bitmap requires exactly one pass over the data set to compute all Sky-line points. This progressive algorithm uses bitmaps to encode all informationthat is needed to decide whether a point is part of the Skyline or not. A pointp with d dimensions is mapped to a bit vector. This bit vector holds informa-tion about the rank of each value p1, p2, ..., pd compared to other values of thesame dimension. The length of the bit vector is determined by the number ofdistinct values over all dimensions. The Bitmap algorithm can return Skylinepoints by scanning the whole data set once. The efficiency of the algorithmrelies totally on the speed of bitwise operations.

The Bitmap algorithm has two drawbacks because it lies in the use ofbitmaps and bit-slices in order to determine if a point is in the Skyline ornot. First, for each point inspected the algorithm must retrieve the bitmapsof all dimensions of all points in order to get the bit-slices. Second, if thenumber of distinct values is large, the space consumption of the bitmaps maybe prohibitive for the algorithm. The Bitmap algorithm can only be appliedif the number of distinct values is small.

Index

Another index-based progressive technique called Index method, which canoutput the skyline without having to scan the whole dataset, is introduced in[116] and [44]. Index exploits a transformation mechanism and a B+-tree indexto return skyline points in batches. Essentially, each point is transformed into


a single dimensional space, and stored in a B+-tree structure. Moreover, pointswith some common features (same mapping value) are clustered together. Thesort order in the transformed space allows the examination of points that arelikely candidates to be skyline points first. Additionally, it also allows thepruning of points that are clearly dominated by some other points.

Index organizes the data into d lists, where the ith list (1 ≤ i ≤ d) containspoints whose coordinates on the ith axis are the smallest among all the dimen-sions. Index scans the d lists sequentially and simultaneously from the firstentries. An object appears in ith list if its ith coordinate-value is the minimumamong all the dimensions. The ith list is sorted in ascending order of the ith

coordinate-value. If the current unexamined object in a list has the key valuelarger than the maximum coordinate-value of some object, the remaining ofthe list can be pruned.

Experiments carried out in [116] shows that the Index method is superiorthan the non-index methods (BNL and D&C) in terms of overall runtime.Moreover, the Bitmap scheme performs well when the number of dimensionsis small.

NN Algorithm

Kossmann et. al. [89], presented a skyline computation algorithm based onnearest neighbor (NN) search on the indexed dataset. NN uses the results ofnearest-neighbor search to partition the data space recursively. The datasetis indexed by an R-tree; Guttman [62], Sellis et. al. [110], Beckmann et. al.[13]. The NN algorithm performs a nearest-neighbor query (using an existingalgorithm such as one of the proposed by Roussopoulos et. al. [107], or byHjaltason and Samet [68]) on the R-tree, to find the point with the minimumdistance (mindist) from the origin. It can be shown that the first nearestneighbor is part of the skyline [89]. On the other hand, all the points in thedominance region of the first nearest neighbor can be pruned from furtherconsideration.

Fig. 2.5. Discovery of points i and a [89]


The partitions resulting after the discovery of a skyline point are insertedin a to-do list. While the to-do list is not empty, NN removes one of thepartitions from the list and recursively repeats the same process. If a partitionis empty, it is not subdivided further. In general, if d is the dimensionalityof the data space, a new skyline point causes d recursive applications of NN.In particular, each coordinate of the discovered point splits the correspondingaxis, introducing a new search region towards the origin of the axis. For d > 2,the overlapping of the partitions necessitates duplicate elimination.

In order to characterize an algorithm as online, Kossmann et. al. in [89]introduced several criteria so that the user gets a good big picture of theSkyline. In other words, the algorithm should be fair and not favor points thatare particularly good in one dimension. In this context the Index algorithm[116] fails to meet these requirements, as this algorithm returns extreme pointsfirst which are good in one dimension.

Branch and Bound Skyline

Papadias et al. [101], [102] propose a Branch and Bound Skyline (BBS) al-gorithm to progressively output skyline points of a dataset indexed by anR-tree. The difference to the NN algorithm [89] discussed before is that NNissues multiple nearest neighbor queries, while BBS only traverse the indexonce. One of the most important properties of this technique is that it guar-antees the minimum I/O cost for a given R-tree, i.e. it accesses fewer diskpages than any algorithm based on R-trees (including NN).

BBS processes the (leaf/intermediate) entries in ascending order of theirmindist (minimum distance) to the origin of the data space. At the beginning,the root entries are inserted into a heap using their mindist as the sorting key.Then, the algorithm removes the top entry from the heap, accesses its childnodes, and enheaps all the entries there.

BBS maintains an additional main-memory R-tree for dominance tests.Whenever an intermediate/leaf entry is removed from the heap, is tested fordomination using the main-memory R-tree. A non-dominated point is used forpruning and is inserted into the main-memory R-tree. Extensive experimentsin [101] have shown that BBS outperforms NN by several orders of magnitude.BBS is the best centralized method for skyline computation.

2.2.3 Subspace Skyline Algorithms

A more general scenario, as opposed to the assumption that the query dimen-sions are fixed is the case where the query dimensions can vary. Given a setof objects with d dimensions, a skyline query can be issued on any subset k(k < d) of the d dimensions. The interested subset of d dimensions is calledsubspace and the corresponding skyline query is called subspace skyline query.Full-space skyline approaches are optimized for a fixed set of dimensions, thusare not efficient for the general case.


The Skyline Cube

Recently, Pei et al. [105] and Yuan et al. [131] independently proposed theSkyline Cube (SKYCUBE), which consists of the skylines in all possible sub-spaces. Similar to the idea of data cube [58], [59] in the data warehouse, theSKYCUBE consists of the skylines of 2d subsets of d dimensions. These paperspropose efficient computation methods of finding skyline points in subspacesthrough exploiting various sharing strategies such as bottom-up and top-downmethods for the initial computation of the SKYCUBE.

Pei et al. [105] discussed the subspace skylines primarily from the viewof the query semantics. They solved the skyline membership query, why andin which subspaces is an object in the skyline, by using the notion of skylinegroup. With the complete SKYCUBE, the subspace skyline query can beanswered with little overhead cost.

The Compressed Skyline Cube

Recently, Xia and Zhang [128] address the update support of the SKYCUBE.The SKYCUBE, as already mentioned before, pre-computes all subspace sky-lines and provides fast query response at query time. In order to deal with adynamic environment where objects are updated frequently a new structure,called the Compressed Skyline Cube (CSC), and an incremental and scalableupdate scheme is proposed. CSC concisely preserves the essential informationof all subspace skylines, without comprising the query efficiency. The proposedupdate scheme for CSC is object aware, such that updates of different objectstrigger different amount of computation.

More recently, Pei et. al. [104], propose an efficient method, Stellar, whichexploits an interesting skyline group lattice on a small subset of objects whichare in the skyline of the full space. The authors show that this skyline grouplattice is easy to compute and can be extended to the skyline group lattice onall objects. After computing the skyline in the full space, Stellar only needsto enumerate skyline groups and their decisive subspaces using the full spaceskyline objects. Stellar is shown to be superior to CSC [104].

The SUBSKY Algorithm

SUBSKY recently proposed in [119] is an index-based method to com-pute skylines in low-dimensionality subspaces, while the total dimensional-ity may be high. This method transforms the multi-dimensional data intoone-dimensional, and indexes the dataset with a B+-tree. Based on the datadistribution, SUBSKY creates an anchor point for each cluster, and builds aB+-tree on the L∞ distance between each object to its corresponding anchor.

A drawback of SUBSKY is the fact that the anchor points are never mod-ified after the initial computation. Since the pruning power is largely decidedby the choice of the anchor points, SUBSKY is not suitable for dynamic data


where the data distribution may change over time. Furthermore, SUBSKY’spruning ability deteriorates fast with the increase of query dimensionality,which makes it inappropriate for computing the skylines in arbitrary largesubspaces. Another drawback is that the proposed algorithm does not deliverthe skyline points progressively.

The DADACUBE

Li et. al. in [92] define a novel data cube called DADA (Data Cube for Dom-inant Relationship Analysis), which captures the dominant relationships be-tween products and customers, to help firms delineate market opportunitiesbased on customer preferences and competitive products. The concept of dom-inance is extended for business analysis from a microeconomic perspective.More specifically, a new form of analysis is proposed, called Dominant Re-lationship Analysis (DRA), which aims to provide insight into the dominantrelationships between products and potential buyers. Three new classes of sky-line queries called Dominant Relationship Queries (DRQs) are consequentlyproposed for analysis purposes. These types of queries are: (1) Linear Opti-mization Queries (LOQ), (2) Subspace Analysis Queries (SAQ), and (3) Com-parative Dominant Queries (CDQ). Efficient computation for such queries isachieved through a novel data structure.

2.2.4 High-Dimensional Skylines

In a high-dimensional space skyline points no longer offer any interesting in-sights as there are too many of them. More particularly, as the dimensionalityof the data set grows, the skyline operator begins to lose its discriminatingpower and returns a large fraction of the data. This is because with an increas-ing dimensionality, the chance of one point dominating another becomes low.As a result, the number of skyline points becomes numerous to provide anyhelpful information for the user. Recently, there has been work on identifyinginteresting skylines to address the problem of having too many skyline pointsparticularly when the data is high dimensional.

Top-k Frequent Skylines

Chan et al. [31] studied finding top-k frequent skyline points in multidimen-sional subspaces. The authors introduce the concept of the skyline frequencyof a data point, which compares and ranks the interestingness of data pointsbased on how often they are returned in the skyline when different numberof dimensions (i.e. subspaces) are considered. Intuitively, a point with a highskyline frequency is more interesting as it can be dominated on fewer combi-nations of the dimensions. Approximate algorithms are developed to addressissues involved in high dimensional spaces, namely the dimensionality courseand the fact that frequent skyline computation requires that skylines be com-puted for each of an exponential number of subsets of the dimensions.


k-dominant Skylines

Chan et al. [31] considered the k-dominance relation and explored k-dominantskylines, which is a generalization of the skyline concept. A point p is said tok-dominate another point q if there are k dimensions in which p is better thanor equal to q and better than q in at least one dimension. Three algorithms tosolve the k-dominant skyline problem are proposed. The first algorithm, One-Scan Algorithm (OSA), uses the property that a k-dominant skyline pointcannot be worse than any conventional skyline on more than k dimensions.This algorithm maintains the conventional skyline points in a buffer during ascan of the data set and uses them to prune away points that are k-dominated.As the whole set of conventional skyline points can be large, the authorsproposed the Two-Scan Algorithm (TSA). In the first scan, a candidate set ofdominant skyline points is retrieved by comparing every point with a set ofcandidates. The second scan verifies whether these points are truly dominantskyline points. This method turns out to be much more efficient than theone-scan method. A theoretical analysis is provided to show the reason forits superiority. The third algorithm, Sorted Retrieval Algorithm (SRA), ismotivated by the rank aggregation algorithm proposed by Fagin et al. [45],which pre-sorts data points separately according to each dimension and then”merges” these ranked lists. In addition, the authors define the top-δ dominantskyline query and the weighted (dominant) skyline query, and showed how thethree algorithms for the k-dominant skyline problem could be extended toaddress these problems.

k-most Representative Skylines

Lin et. al. [95], study the problem of selecting k skyline points so that thenumber of points, which are dominated by at least one of these k skylinepoints, is maximized. A dynamic programming based exact algorithm in a2d-space is proposed. The authors show that the problem is NP-hard whenthe dimensionality is 3 or more and it can be approximately solved by apolynomial time algorithm with the guaranteed approximation ratio 1− 1/e.To speed-up the computation, an efficient, scalable, index-based randomizedalgorithm is developed by applying the probabilistic counting techniques.

2.2.5 Distributed and P2P Skyline Algorithms

In this subsection we review skyline algorithms based on a distributed accessmodel [7], [5], [9], [10]. Such algorithms work by querying d independent sub-systems, each managing a specific skyline attribute and returning objects or-dered according to the preference on that attribute (e.g., minimize the price).By iterating on the streams of incoming results, the skyline can be computedby just looking at those objects that are returned by at least one subsystembefore a single object p is returned by all subsystems.


To find the top objects under some monotone aggregation function, Fa-gin proposed three methods, FA, TA and NRA in [45] which are optimal insome cases. The FA algorithm accesses in parallel the sorted lists on everydimension. We can find the top-k points when there is a set of at least kpoints such that each of them has been seen in each list. The TA algorithmimproves the FA algorithm by setting a threshold by the function from allthe smallest value that have seen from all the lists. The algorithm stops whenthe current top-k points all have aggregation value larger than the threshold.The NRA algorithm works with only sorted access. The smallest values seenin all dimension lists are used to calculate the upper bound on the points notseen. We can get top-k result without exact aggregation value, when the lowerbounds on the current top-k points are larger than the upper bound on the(k + 1)th point.

To perform all the necessary dominance tests, thus to eventually determinethe actual skyline, missing attribute values for all candidate objects are thenobtained through a series of random accesses, each having as input the objectidentifier for which attribute values are sought.

Distributed Skyline Algorithm

The first algorithm for distributed environments, called Distributed SkylineAlgorithm (DSA), was proposed in [7] and enhanced in [5]. These algorithmsassume that each Web source provides a globally ordered score list, whichcan be accessed by sorted and random access. In these approaches, data isvertically distributed across different web information services with each siteproviding one attribute of the data object. Skyline points are calculated persite and reported to users at a central point. They show how to efficiently per-form distributed skyline queries and thus essentially extend the expressivenessof querying current Web information systems. They also propose a samplingscheme that allows getting an early impression of the skyline for subsequentquery refinement.

Skyline Algorithms in PDMS

Algorithms which are based on the distributed access model discussed aboveare hardly applicable to non specialized P2P systems [70], [127], [121], and[122] since they cannot provide lists in globally sorted order. Such a need forglobal knowledge is the main problem of most existing strategies. In central-ized database systems global knowledge can be exploited for query processing.In such systems the query result is always complete and correct with respectto the given data since all data has been considered. In dynamic environments,like P2P systems, flooding the network is too expensive. Since such systemsusually do not provide any global knowledge, routing indexes are used to routequeries to only those peers that are most likely to contribute to the final result.


The first paper that focuses on P2P skyline computation is [69]. The au-thors focus on Peer Data Management Systems (PDMS), where each peerprovides its own data with its own schema. Their techniques provide proba-bilistic guarantees for the result’s correctness. The authors extend their workin [70] by proposing multidimensional routing indexes for PDMS based on theQ-tree, a fusion of R-trees and histograms.

Hose et. al. in a recent work [71] proposed strategies for processing skylinequeries in distributed environments using a subclass of routing indexes calleddistributed data summaries. Using such structures allows for reducing queryexecution costs by efficiently routing the query to only those peers that pro-vide relevant data with respect to a given query. The exact skyline algorithmuses distributed data summaries to avoid expensive flooding. In order to fur-ther reduce network load, the authors propose computing a relaxed skylinethat makes it possible to significantly reduce query execution costs at the ex-pense of result quality. The idea was inspired by the concept of thick skylines[78] that not only contain the actual skyline points but also all points within acertain distance around them. Quite contrary to this, a relaxed skyline allowsa skyline point to be represented by any point within a predefined distance.In conjunction with distributed data summaries processing such queries isefficient and still provides the user with the big picture. The maximum relax-ation, i.e., the maximum distance between a representative data item and theworst point of the represented region, is defined by the user when issuing thequery. So the user himself specifies the maximum relaxation he is willing totake. Furthermore, a tree-based structure for summarizing data as a basis forsuch routing indexes is proposed.

DSL

Wu et al. [127] first addresses the problem of parallelizing progressive sky-line queries on a share nothing architecture and the use of partial ordersover data partitions is exploited. The authors proposed an algorithm calledDistributed SkyLine algorithm (DSL) that considers distributed processingof skyline queries in structured overlay networks (CAN). Towards this goalthe authors proposed recursive region partitioning and dynamic region en-coding, for implementing this partial order for pipelining machines in queryexecution. Their techniques enforce the skyline partial order, so that the sys-tem pipelines participating machines during query execution and minimizesinter-machine communication. In contrast to PDMS these systems provideinformation about data location.

The authors in [53] study the problem of parallelizing progressive skylinequeries in the multi-disk environment. Two parallel algorithms for skylinequeries are proposed, called Basic Parallel Skyline (BPS) and Improved Par-allel Skyline (IPS) algorithms, using the parallel R-tree as the underlyingstructure. Furthermore, IPS captures optimal I/O (i.e., the number of node


accesses) and enables several effective pruning strategies to discard the non-qualifying entries during the execution of the algorithm.

SKYPEER

Vlachou et. al. [121] addresses the efficient computation of subspace skylinequeries in large-scale peer-to-peer (P2P) networks, where the dataset is hor-izontally distributed across the peers. The authors, relying on a super-peerarchitecture, proposed a threshold based algorithm, called SKYPEER, whichforwards the skyline query requests among peers, in such a way that theamount of transferred data is significantly reduced. The notion of domina-tion is extended by defining the extended skyline set, which contains all dataelements that are necessary to answer a skyline query in any arbitrary sub-space. In addition, the authors prove that SKYPEER algorithm provides theexact answers and presented optimizations to reduce communication cost andexecution time.

SSP

Based on a balanced tree structured P2P network, the authors proposed askyline processing algorithm in [122] called Skyline Search Space Partitioning(SSP) to partition the skyline space adaptively to control query forwardingbehavior effectively. By being able to estimate the peer nodes within the querysubspaces, we are able to control the amount of query forwarding, limitingthe number of peers involved and the amount of messages transmitted inthe network. Consequently, this technique is able to significantly reduce thenumber of visited nodes and search messages. In addition, load balancingis achieved in query load conscious data space splitting/merging during thejoining/departure of nodes and through dynamic load migration.

2.2.6 Skyline Algorithms in other Environments

Besides skyline computation which focus on centralized or distributed sys-tems, there are also a long research stream concerning skyline computation inenvironments such as data streams [94], spatial databases [112], moving objectdatabases [74] and skyline query processing in partially-ordered domains [30](categorical attribute domains). These approaches propose algorithms whichare suitable in such applications. In the remaining of this section we look atsome of the proposed approaches in the literature.

Partially-ordered Domains

Chan et al. [29], [30] study partially-ordered domains and extends skylinequery to categorical attribute domains where total order may not exist. The


solution proposed in [30] is to transform each partially-ordered attribute intoa two-integer domain that allows the exploitation of index-based algorithmsto compute skyline queries on the transformed space. Based on this frame-work, the authors propose three novel algorithms: BBS+ is a straightforwardadaptation of BBS using the framework, SDC (Stratification by DominanceClassification) and SDC+ are optimized to handle false positives and supportprogressive evaluation. Both SDC and SDC+ exploit a dominance relationshipto organize the data into strata. While SDC generates its strata at runtime,SDC+ partitions the data into strata offline. Two dominance classificationstrategies (MinPC and MaxPC) to further optimize the performance of SDCand SDC+ are also proposed.

In [6] the authors analyze how far the strict Pareto semantics can be re-laxed while always retaining transitivity of the induced Pareto aggregation.They show how the Pareto semantics can be modified to benefit from indif-ferences: skyline result sizes can be essentially reduced by allowing the userto declare some incomparable values as equally desirable. A major problemof adding such equivalences is that they may result in intransitivity of theaggregated Pareto order and thus efficient query processing is hampered.

Data Stream Monitoring

Lin et al. [94] investigate continuous skyline monitoring on data streams. Theauthors focus on the problem of efficiently computing the skyline against themost recent N elements in a data stream seen so far. Specifically, the n-of-Nskyline query problem is studied; that is, computing the skyline for the mostrecent n (for all n ≤ N) elements. Firstly, an effective pruning technique tominimize the number of elements to be kept is developed. The authors haveshown that on average storing only O((logN)d) elements from the most recentN elements is sufficient to support the precise computation of all n-of-N sky-line queries in a d-dimension space if the data distribution on each dimensionis independent. Then, a novel encoding scheme is proposed, together with effi-cient update techniques, for the stored elements, so that computing an n-of-Nskyline query in a d-dimension space takes O(logN + s) time that is reducedto O(d ∗ loglogN + s) if the data distribution is independent, where s is thenumber of skyline points. Thirdly, a novel trigger based technique is providedto process continuous n-of-N skyline queries with O(delta) time to update thecurrent result per new data element and O(logs) time to update the triggerlist per result change, where delta is the number of element changes from thecurrent result to the new result. Finally, the proposed techniques are extendedto computing the skyline against an arbitrary window in the most recent Nelements. Extensive experiments demonstrated that the new techniques cansupport online skyline query computation over very rapid data streams.

Tao and Papadias [117] studies skyline computation in stream environ-ments, where query processing takes into account only a ”sliding window”covering the most recent tuples. Two algorithmic frameworks, (called Lazy


Method and Eager Method) for continuously monitoring skyline changes overstream data are proposed based on several interesting characteristics of theproblem. The proposed techniques utilize several properties of stream sky-lines to improve space/time efficiency by expunging data from the system asearly as possible (i.e., before their expiration). Furthermore, the asymptoticalperformance of the proposed solutions is analyzed.

More recently, Mouratidis et. al. [96] studies continuous monitoring oftop-k queries over a fixed-size window W (number of active tuples or timeunits) of the most recent data, where the data in W reside in main mem-ory. A general methodology for top-k monitoring that restricts processing tothe sub-domains of the workspace that influence the result of some query isproposed. The valid records are indexed by a grid structure, which also main-tains book-keeping information. Two processing techniques are proposed: thefirst one, which is called Top-k Monitoring Algorithm (TMA), computes thenew answer of a query whenever some of the current top-k points expire;the second one, termed Skyband Monitoring Algorithm (SMA), partially pre-computes the future changes in the result, achieving better running time atthe expense of slightly higher space requirements.

Spatial and Spatio-temporal Environments

The skyline queries can be either absolute or relative, where ’absolute’ meansthat minimization is based on the static attribute values of data points inP , while ’relative’ means that the difference between a data point in P anda user-given query point needs to be computed for minimization. Relativeskyline query is also known as Dynamic Skyline Query (DSQ) [101], [102].The best known skyline algorithm that can answer Dynamic Skyline Queries(DSQ) is the Branch-and-Bound Skyline (BBS) algorithm [101]. BBS is aprogressive optimal algorithm for the conventional skyline query. In the settingof BBS, a dynamic skyline query specifies a new n-dimensional space basedon the original d-dimensional data space. First, each point p in the databaseis mapped to the point p′ = (f1(p), ..., fn(p)) where each fi is a function ofthe coordinates of p. Then, the query returns the general (i.e., static) skylineof the new space (the corresponding points in the original space).

The Spatial Skyline Query (SSQ) can be defined as a special case of thedynamic skyline query. Given the query set Q,fi = dist(p, qi) is used to mapeach point p to p′. The concept of Spatial Skyline Queries is first introduced bySharifzadeh and Shahabi in [112] and the geometric properties of the solutionto these queries is examined. Given a set of data points P and a set of querypoints Q, each data point has a number of derived spatial attributes eachof which is the point’s distance to a query point. An SSQ retrieves thosepoints of P which are not dominated by any other point in P consideringtheir derived spatial attributes. The main difference with the regular skylinequery is that this spatial domination depends on the location of the querypoints Q. Two algorithms for spatial skyline queries considering static query


points were proposed. A variation of the spatial skyline queries problem formoving query points is also studied and the authors showed how all threeproposed algorithms can address the spatial skyline queries problem whenmixed with non-spatial attributes. The authors exploit geometric concepts ofthe space to develop their algorithms. In their paper, the authors also proposea voronoi-based technique which can efficiently handle continuous queries.

A Multi-Source Skyline Query (MSSQ) [43] considers several query pointsat the same time (e.g., to find hotels which are cheap and close to the Uni-versity, the Botanic Garden and the China Town). The moulti-source skylinequery is also known as spatial skyline query [112]. However, this work con-centrate on road networks (i.e. non-Euclidean space). The authors proposethree algorithms, the Collaborative Expansion algorithm (CE), the EuclideanDistance Constraint algorithm (EDC) and the Lower Bound Constraint algo-rithm (LBC). CE is a straightforward algorithm using an underlying paradigmthat identifies network skyline points by expanding the search space centeredaround each query point incrementally. EDC is an approach to control thesearch directions for network skyline points by using Euclidean skyline pointsas a guide. LBC is based on network nearest neighbor algorithms and usesa novel concept of path distance lower bound to minimize the cost of net-work distance computation. In addition, they prove that LBC is an instanceoptimal algorithm.

In the context of spatio-temporal applications, Huang and Jensen [72]studied the interesting problem of finding locations of interest which are notdominated with respect to only two attributes: their network distance to asingle query location q and the detour distance from q’s predefined routethrough the road network. Their proposed algorithms rely on existing nearestneighbor and range query approaches to find a candidate set. They then applynaive in-memory skyline computation on the candidate set to extract theresult.

In a more recent work [73], the authors study skyline computation overmobile devices. A setting with mobile devices communicating via an ad hocnetwork is assumed, and skyline queries that involve spatial constraints arestudied. The authors present techniques that aim to reduce the costs of com-munication and reduce the execution time on each single device.

In [74], the authors propose a novel kinetic-based data structure and anassociated efficient query processing algorithm. In their paper, the authorsbasically investigate the spatio-temporal coherence of the continuous skylineevaluation problem and propose an elegant strategy to handle moving objects.

Approximate Skyline Computation

The problem for computing approximate skylines is that, even for uniformdata, we cannot probabilistically estimate the shape of the skyline based onlyon the dataset cardinality N . In fact, it is difficult to predict the actual numberof skyline points (as opposed to their order of magnitude). The approximation


error corresponds to the difference of the Skyline Search Regions (SSRs) ofthe two skylines, that is, the area that is dominated by exactly one skyline[102].

Koltun and Papadimitriou [85] aim to find a minimum set of points toapproximately dominate all data points. Specifically, for a given ε (ε ≥ 0), finda subset P ′ of a given P , with the minimum cardinality, such that every p ∈ Pis dominated by a (1− ε)q where q ∈ P ′. It has been shown that the problemcan be solved by a greedy heuristic in a 2d-space, and it is NP-hard for d = 3 ormore. Then, the authors show that the problem can be approximately solvedwith a poly-logarithmic cardinality (for ε > 0) of P ′ regarding the values’domain. However, this technique does not guarantee that the data points inP ′ are skyline points unless ε = 0; when ε = 0, all skyline points are returned.Second, the size of P ′ is poly-logarithmic regarding the values’ domain.

In [78], the concept of Thick Skyline Queries (TSQ), which recommendsto users not only skyline objects but also their nearby neighbors within ε-distance, is proposed. Formally, given a d-dimensional database and the sky-line set s1, s2, ..., sm, the thick skyline is the set of all the following objects: (1)the skyline objects, and (2) the non-skyline objects which are ε-neighbors ofa skyline object. The categorization of this generalized skyline object is doneinto three classes: (a) a single skyline object, called outlying skyline object,(b) a dense skyline object, which is in a set of nearby skyline objects, and (c)a hybrid skyline object, which is in a set consisting of a mixture of nearbyskyline objects and non-skyline objects.

Hose et. al. [71] proposed the concept of Relaxed Skyline Queries. Theidea was inspired by the concept of thick skylines [78] that not only containthe actual skyline points but also all points within a certain distance aroundthem. Quite contrary to this, a relaxed skyline allows a skyline point to berepresented by any point within a predefined distance.

Skyline Maintenance

Wu et. al. [126] addresses the problem of efficient maintenance of a mate-rialized skyline view in response to skyline removals. A systematic way todecompose a d-dimensional EDR into a collection of hyper-rectangles is pro-posed. The authors show that the number of such hyper-rectangles is O(md),where m is the current skyline result size. Then, an algorithm called DeltaSkyis proposed, which determines whether an intermediate R-tree MBR intersectswith the EDR without explicitly calculating the EDR itself. This reduces theworse case complexity of the EDR intersection check from O(md) to O(m∗d).Thus DeltaSky helps the branch and bound skyline algorithm achieve I/O op-timality for deletion maintenance by finding only the newly appeared skylinepoints after the deletion. The authors also discuss how to implement DeltaSkyusing one extra B-tree. Moreover, two optimization techniques which furtherreduce the average cost in practice are proposed.


Skyline Cardinality Estimation

The problem of estimating the size of Skyline has been considered in [16] and[55] under the assumption that no two data items have the same value on anyattribute and all attributes are independent. In a more recent work [32] theauthors remove these assumptions and present a more robust skyline estima-tion formula without the independence assumption. In addition, the authorsconsidered the issues with the implementation of Skyline as an operator inrelation engine. Two physical algorithms were considered, the Block-Nested-Loop (BNL) algorithms and the Presorting (SFS) algorithm, and showed howthese algorithms can be costed. Finally, the techniques were implemented ina prototype Microsoft SQL Server and showed that Skyline queries can sig-nificantly benefit from such an implementation.

2.2.7 Summary

All discussed skyline algorithms have their particular advantages and disad-vantages. The decision what skyline algorithm to use requires a thoroughanalysis of the situation the algorithm should be used for and a thoroughconsideration of the pros and cons of each algorithm.

Generic approaches compare each point of the database with every otherpoint, or divides the data space into several regions, calculate the skyline ineach region, and produce the final skyline from the points in the regionalskylines. To improve performance, one alternative is to first sort the dataaccording to a monotone function. The other alternative is to combine thelast merge pass of the sorting algorithm with the first skyline-filter pass.

Due to the poor performance of the above methods for computing theskyline, other approaches make use of an index structure in order to prune thesearch space and speed up the computation. One alternative is to use efficientbit operations or to transform the points into a single dimensional space andto use a B+-tree. The first online approach is a skyline computation algorithmbased on nearest neighbor (NN) search on the indexed dataset which uses theresults of nearest-neighbor search to partition the data space recursively. Adifferent method to the NN algorithm, is to avoid multiple nearest neighborqueries, and to traverse the index once.

A more general scenario, as opposed to the assumption that the querydimensions are fixed is the case where the query dimensions can vary. TheSKYCUBE exploiting various sharing strategies such as bottom-up and top-down methods for the initial computation of the skyline cube. SUBSKY isan index-based method to compute skylines in low-dimensionality subspaces,while the total dimensionality may be high.

Because in a high-dimensional space skyline points no longer offer anyinteresting insights as there are too many of them, there has been work onidentifying interesting skylines to address the problem of having too manyskyline points particularly when the data is high dimensional. In distributed


and/or P2P environments, data is vertically distributed across different webinformation services with each site providing one attribute of the data object.

Besides skyline computation which focus on centralized or distributed sys-tems, other approaches concerns skyline computation in environments suchas data streams and spatial databases. These approaches propose algorithmswhich are suitable in such applications.

Since all these algorithms and ideas have been published previously thispart should be seen as a concise compendium on skyline algorithms. For athorough discussion the reader is forwarded to the particular paper given atthe beginning of each section.

2.3 Discussion

In this chapter, we have reviewed related work on high-dimensional indexingand the associated query processing techniques. High-dimensional indexingstructures were proposed mainly to support range queries and/or k nearestneighbor queries. The high-dimensional space poses new problems such assparsity of data points and low contrast between data points, making thedesign of a robust and scalable high-dimensional index a difficult task.

For high-dimensional indexing, we face the problem of a very small querybut with large query widths. This characteristic causes many internal nodesas well as leaf nodes of conventional multi-dimensional indexes such as the R-tree to be accessed. For instance, the R-tree does not perform well when thenumber of dimensions is high since multiple paths may need to be traversedeven for small queries. Indeed, it has been shown that the sequential scan ismore efficient than the R-tree when the number of dimensions is high [123].

Another difficulty is that the contrast between data points to another pointin terms of distance is very low in high-dimensional space. As the dimensional-ity increases, difference between the nearest data point and the farthest datapoint reduces greatly, i.e., the distance to the nearest neighbor approachesthe distance to the farthest neighbor. In fact, there has been a lot of debateconcerning the meaningfulness of similarity in very high-dimensional spaces[22], [66]. The low contrast makes indexing very difficult and also makes com-putation very expensive.

Common approaches to the design of high-dimensional indexes include en-largement of index nodes (e.g. X-tree) or packing of data entries by storingless information [108], and dimensionality reduction [93]. In order to supportsimilarity search, a high-dimensional index can be used to index the datapoints and efficient algorithms can then be designed to retrieve similar datapoints based on a given query point. Alternatively, a specialized index thatcaptures the notion of distance between points could be designed for simi-larity search. Approaches to similarity search include extension of the R-tree[124], [83]. Due to the nature of the problem, approximate similarity search iswidely accepted as a good compromise between accuracy and response time.

2.3 Discussion 55

Approaches to approximate similarity search include hashing and clustering[54].

Most indexes have their own strengths and weaknesses. Unfortunately,many of the indexes cannot be readily integrated into existing DBMSs. As amatter of fact, it is not easy to introduce a new index into an existing com-mercial data server because other components (e.g., the buffer manager, theconcurrency control manager) will be affected. Hence, we propose to buildour index on top of already existing low dimensional R-trees, not sufferingfrom the ’curse of dimensionality ’ in order to exploit well-tested concurrencycontrol techniques and relevant processing strategies. This standard multidi-mensional indexing structure is implemented in commercial DBMSs [82].

Part II

Similarity Query Optimization

3

Nearest Neighbor Search on VerticallyPartitioned High Dimensional Data

During the last decades, an increasing number of applications, such as medicalimaging [88], molecular biology [3], multimedia and computer aided design,have emerged where a large amount of high dimensional data points have tobe processed. Instead of exact match queries, these applications require anefficient support for similarity queries. Among these similarity queries, thek-nearest neighbor query (k-NN query), which delivers the k nearest pointsto a query point, is of particular importance for the applications mentionedabove.

Different algorithms have been proposed [65], [67], [100] for supportingk-NN queries on multidimensional index-structures, like R-trees. Multidimen-sional indexing has extensively been examined in the past, see [52] for a surveyof various techniques. The performance of these algorithms highly dependson the quality of the underlying index. The most serious problem of multi-dimensional index-structures is that they are not able to cope with a highnumber of dimensions. This disadvantage can be alleviated by applying di-mensionality reduction techniques [23]. However, the reduced dimensionalitystill remains too high for most common index-structures, such as R-trees.After the high effort put into implementing R-trees in commercial databasemanagement systems (DBMS), it seems that their application scope is unfor-tunately quite limited to low-dimensional data. In this chapter, we revisit theproblem of employing R-trees (or other multidimensional index-structures)for supporting k-NN queries in an iterative fashion and we present a newapproach to indexing multidimensional data that is particularly suitable foran efficient incremental processing of nearest neighbor queries. The basic ideais to use index striping that vertically splits the data space into multiplelow-dimensional data spaces. The data from each of these lower-dimensionalsubspaces is organized by using a standard multi-dimensional index structurewhich is well performing and not suffering from the ”dimensionality curse”.

The remaining of this chapter is structured in the following way. In thenext section we motivate the need for low-dimensional R-trees in order to copewith NN search in high dimensional data spaces and we present our architec-

60 3 Nearest Neighbor Search on Vertically Partitioned High Dimensional Data

ture for similarity query processing. In Section 3.2, we present our multi-stepnearest neighbor algorithms that dynamically exploits the information of in-ternal nodes of the R-tree. Thereafter in Section 3.3, we provide a cost modelrelying on a power law and we propose a formula for computing the appro-priate number of indexes. In Section 3.4 we present algorithms for assigningthe dimensions to the indexes. In Section 3.5 we present some powerful ex-tensions, such as NN search using external priority queues. The results of ourexperiments are presented in Section 3.6, and finally, we conclude in Section3.7.

3.1 Introduction

Nearest neighbor queries (NN-queries) are among the most important op-erations in todays databases where the notion of similarity has become in-dispensable. An efficient processing of NN-queries should be supported forhigh-dimensional data, while the query interface should be flexible enough toallow the user to influence the underlying distance metric. These requirementsare difficult to satisfy when commonly available multidimensional index struc-tures, such as R-trees [62], should be employed for processing nearest neighborqueries.

The origin of our work starts from [19] where a quite similar work hasbeen proposed for supporting range queries on high-dimensional data. Basedon a uniform cost model, the authors first present a formula for the optimalnumber of multi-dimensional indexes. Two techniques are then presented forassigning dimensions to indexes. The first one simply follows a round-robinstrategy, whereas the other exploits the selectivity of the different dimensionssuch that the dimension with the highest selectivity is assigned to the firstindex and so on.

The assignment strategy offers some deficiencies which makes them in-adequate for k nearest neighbor (k-NN) queries. First, this strategy assumesdimensions being independently from each other. This may obviously leadto suboptimal assignments. Second, this strategy is not applicable to high-dimensional data and k-NN queries. Since dimensions generally have the samedomain, k-NN queries are symmetric in all dimensions and consequently, theselectivity is constant for all dimensions. Our approach is different from [19]as a new dimension assignment strategy is derived from the fractal dimension[14] of the data set. Moreover, instead of intersecting local results, a mergealgorithm outputs the results in an iterative fashion.

Our basic architecture is outlined in Figure 3.1. Our fundamental assump-tion is that high-dimensional real-world data sets are not independently anduniformly distributed. If this assumption does not hold, the problem is notmanageable due to the well-known effects in high-dimensional space, see forexample [23]. The validity of our assumptions allows the transformation ofhigh-dimensional data into a lower dimensional space. There are, however, two

3.1 Introduction 61

PQ 1 PQ n

...

localNN n

HT TID

PQ Merge

Results

MDF

nextNN

Index 1

Index n

localNN 1

Fig. 3.1. Overview of our Architecture

serious problems with this approach. First, the dimensionality of the trans-formed data might still be too high, in order to manage the data with anR-tree efficiently. Second, this approach is feasible only when the most impor-tant dimensions are globally valid independent from the position of the querypoint.

In our new approach, we address both of these problems by partition-ing the data vertically and then indexing the (low-dimensional) data of eachpartition separately. Considering Figure 3.1, an incremental nearest neighborquery is processed by running a local incremental query for each of the in-dexes concurrently. We keep the entire points in a separate MultidimensionalData File (MDF ) on disk, where each point is accessible via a Tuple Identifier(TID). Additionally, a Priority Queue PQMerge and a Hash Table HTTID arenecessary for an efficient iterative processing of k-NN queries. The key prob-lem that arises from our architecture is to determine an appropriate numberof indexes for supporting k-NN queries efficiently. Thereafter, an effective as-signment of the dimensions is indispensable. These problems are addressedin this chapter. Moreover, we present different iterative algorithms for k-NNqueries that employ the different indexes dynamically.

Our approach is independent from the specific index-structure. We decideto use R-trees, simply because these structures are generally available andperform well in practice. For the sake of simplicity, we assume that the Eu-clidean metric L2 is used as distance function throughout this chapter. For agiven query point, we are therefore able to exploit those R-trees dynamicallywhich give the most benefit to our total query results. Moreover, due to thefact that every R-tree is responsible only for a small number of dimension,their performance is not affected by the curse of dimensionality. It also worthto mention that our algorithms are build on top of existing R-tree implemen-tations and therefore can be immediately applied to commercial DBMS thatoffer R-trees and iterative nearest neighbor queries.


3.2 Incremental Nearest Neighbor Algorithms

In this section, we present three iterative algorithms for k-NN queries thatcomply with our architecture. In the following, we assume that there are nindexes. Each of the n indexes is responsible for a lower-dimensional subspaceof the original data space. Furthermore, a distance function dfi

and a localpriority queue PQi are associated to the ith index. A global queue PQMerge

is used for merging local results of the indexes. Since an index only deliverspartial points, i.e. points from the subspaces, the multidimensional file (MDF )has to be accessed to obtain the entire point. For a complete reference to thesymbols used in this chapter see Table 3.1.

Table 3.1. Overview of symbols

Symbol Description

d Dataset dimensionality

n Number of indexes

N Cardinality of the dataset

P Dataset

q Query point

p, o Data points

k k Nearest Neighbors

mini Partial threshold distance

Ai i-th attribute

do Object distance function

df Feature distance function

D2 Fractal (correlation) dimension

Ceff Effective data page capacity

MDF Multidimensional data file

TID Tuple identifier

PQMerge Global priority queue

PQi Local priority queue

HTTID Hash table

Each of the algorithms follows our framework in a different way. The firstalgorithm is a straightforward adoption of the classical incremental algorithm[65], [68]. The second algorithm performs similar to the threshold algorithm(TA) proposed by Fagin et. al. [45], but supports arbitrary queries due to thefact that the indexes efficiently compute an appropriate sorting of the data.The advantage of this algorithm is that it runs on top of existing multidimen-sional index-structures, e.g. R-trees, that support an incremental processingof nearest neighbor queries. Finally, the third algorithm is a hybrid of theprevious algorithms. It behaves similar to TA, but additionally exploits theinternal entries of the index-structure.

3.2 Incremental Nearest Neighbor Algorithms 63

3.2.1 Best-Fit

This algorithm is a straightforward adoption of the classical incremental al-gorithm [68] for more than one index. During the initialization step of thequery, the roots of all n indexes are inserted into PQMerge. During an itera-tion step, the algorithm pops the top element from the queue PQMerge anddifferentiates between the following three cases:

1. If the top element is an index entry, the entry is expanded, i.e., the refer-enced page is read from the corresponding index and its entries (points)are added to PQMerge.

2. If the top element is a partial point, i.e., a point that is the first time on thetop position, the hash-table is probed with the corresponding TID againstthe hash table HTTID. If this is successful, we ignore the point. Otherwisethe TID is inserted in the hash-table and the entire point is read fromMDF . Then, the distance between the entire point and the query pointis calculated. Finally, the entire point is inserted into PQMerge.

3. If the top element is an entire point, we deliver the point as the nextneighbor to the user.

The iteration step stops when the third case has occurred for the first time.Two issues require a careful discussion. First, we use the distance function dfi

when we have to calculate the distance between the query point and a partialpoint that has been received from the ith index. In a similar way, we calculatethe distance between index entries and the query point. Note also that dfi isa lower bound for the global distance function df . Second, ignoring a partialpoint (case 2) does not harm the correctness of the algorithm. This only canoccur when the same point has been read previously. This means that eitherthe entire point is or it might even be delivered as output to the user. Notethat partial points that appear more than once on top can be safely discarded.

With respect to the indexes, the I/O performance of the algorithm for re-trieving the first k nearest neighbors is equal to the performance of the algo-rithm [19] for processing an equivalent window query. The associated windowis given by [q1− r, q1 + r] ∗ ... ∗ [qn− r, qn + r] where r is the distance betweenq and the kth nearest neighbor.

3.2.2 TA-Index

Our TA-Index algorithm is similar to approaches which work by queryingd independent subsystems, each managing a specific attribute and returningobjects ordered according to the similarity on that attribute. By iterating onthe streams of incoming results, the top-k results can be computed by justlooking at those objects that are returned by at least one subsystem before asingle object p is returned by all subsystems.

This algorithm performs similar to TA [45], but supports arbitrary NNqueries due to the fact that the indexes are able to deliver the data in an


appropriate order. TA-Index performs different to Best-Fit as a local incre-mental NN query runs for every index concurrently. The partial points, whichare delivered as the results of these local queries, are merged by inserting theminto PQMerge. The algorithm performs similar to Best-Fit as partial pointsthat appear on top are replaced by their entire points. In analogy to TA, wekeep a partial threshold mini for the ith index where mini is the distancebetween the last partial result of the ith index and the query point of the NNquery. An entire point from PQMerge is returned as the next nearest neighborif its distance is below the global threshold that is defined as the squared sumof the partial thresholds. The faster the partial thresholds increase the higherthe chance that the top element of PQMerge is returned as a result. Algorithm1 contains the code of the iteration phase of the algorithm.

Algorithm 1 TA-Index: nextNearestNeighbor()

1: while (√

min21 + ... + min2

n < L2(q, PQMerge.top()) do2: ind ← activityScheduler.next();3: cand ← incNN(i).next();4: minind ← dfind(q, cand);5: if (not hashTable.contains(cand.TID) then6: obj ← MDF.get(cand.TID);7: PQMerge.push(obj);8: hashTable.add(obj.T ID);9: end if

10: end while11: return PQMerge.pop()

The algorithm has two main-memory data structures. The priority queuePQMerge contains all potential nearest neighbors sorted by their object dis-tance to the query point. Note that we use the standard operations top, popand push for manipulating queues. A hash-table is used for monitoring theidentifiers of the objects that have already been inserted into PQMerge.

The algorithm first checks whether the top element is sufficiently small.If so, the top element of PQMerge is given to the user. Otherwise, we enterthe body of the while-loop. At first, we ask the activity scheduler for an indexthat is used for identifying the local incremental algorithm that should bereactivated next. Note that the specific strategy of the scheduler is a parameterof the algorithm. We discuss different scheduling strategies in detail later(subsection 3.2.5). Then, we get the next partial point from the ith incrementalquery. We update the corresponding partial threshold and check whether theTID is already in our hash-table. If not, the entire object is fetched fromMDF and inserted into PQMerge (this require the calculation of the globaldistance). Finally, we insert the corresponding TID into the hash-table.

The difference to the original TA-algorithm [45] is that we consider aspecial scenario where n data sources (our indexes) deliver the data as an


ordered stream, while there is one source (MDF ) that solely supports randomaccesses. The partial thresholds min1, ..., minn have an considerable impacton the performance of the algorithm. The faster the partial thresholds increasethe higher the chance that the top element of PQMerge is returned as a result.The next algorithm is particularly tuned to exploit the internal index entriesto speed up the increase of the partial thresholds.

3.2.3 TA-Index+

This algorithm is an optimized version of TA-Index as it additionally exploitsthe internal index entries to speed up the increase of the partial thresholds.Different to TA-Index is that the incremental NN queries on the local indexesare assumed to deliver not only the data points, but also the internal entriesthat are visited during processing. Note that the distance of the deliveredentries and points is continuously increasing. Therefore, the distances to indexentries can also be used for updating mini without threatening correctness.The incremental step of the algorithm is outlined in Algorithm 2.

Algorithm 2 TA-Index+: nextNearestNeighbor()

1: while (√

min21 + ... + min2

n < L2(q, PQMerge.top()) do2: ind ← activityScheduler.next();3: cand ← incNNPlus(ind).next();4: minind ← L2,ind(q, cand);5: if ((cand is a point) and not HTTID.contains(cand.TID) then6: obj ← MDF.get(cand.TID);7: PQMerge.push(obj);8: HTTID.add(obj.T ID);9: end if

10: end while11: return PQMerge.pop()

Note that, the difference to TA-Index is that a different method is usedfor the incremental nearest neighbor queries and that the condition of the if-statement additionally checks whether the candidate is a point. The advantagein comparison to TA-Index is that the local thresholds are faster increased andtherefore, the cost for producing the global results can be reduced.

3.2.4 Analysis and Correctness

We are especially interested in the case of similarity search, where multi-step nearest neighbor algorithms are usually used, because of the complexityof distance functions, such as quadratic form distance functions. As an ap-proximation of the complex similarity object distance function do, a feature


distance function df can be used in a transformed n-dimensional data space.Usually the feature distance function df is some metric Lp. Our algorithmsare able to handle this case, especially if the transformed data space is toohigh to be indexed efficiently with one index. In this case the lower boundproperty must hold.

Let O be a set of objects. A feature distance function df and an objectdistance functions do fulfill the lower-bounding property, if df underestimatesdo in any case, i.e. for all objects holds the following property:

o1, o2 ∈ O : df (o1, o2) ≤ do(o1, o2)

Here, we use the notion of object distance functions do, even in the casewhere the object distance function is equal to the feature distance functiondf .

The threshold of our multi-step nearest neighbor algorithm guaranteesno false dismissals. To decide whether a candidate is a nearest neighbor ornot, we must be sure that no new candidate can be generated which has asmaller distance than our nearest candidate. The observation that the partialfeature distances increase while the minimum distance decreases, led us to thefollowing lemma.

Lemma 3.1. A candidate with object distance do, for any metric Lp belongsto the result set if the below inequality is true:

p

√∑dfi

p ≥ dmin (3.1)

where dfi is the partial feature distance of the last candidate returned bythe ith index, i.e. the maximum partial feature distance of the ith index.

Proof. Let o1 be the nearest candidate and assuming that o2 is a candidate,which is found after the inequality was satisfied for o1, is closest to the querypoint as o1, i.e. do(o1) > do(o2). No index had found o2 before the inequalityheld; else o2 would be the nearest candidate. This means that dfi(o2) ≥ dfi

for every i, so:

∑dfi(o2) ≥

∑dfi ⇒

∑dfi

p(o2) ≥∑

dfip ⇒ p

√∑dfi

p(o2) ≥ p

√∑dfi

p

and from the termination inequality we conclude:

p

√∑dfi

p(o2) ≥ dmin

However, because of

df(o2) = p

√∑dfi

p(o2)

and


dmin = do(o1)

we conclude to the following:

df(o2) ≥ do(o1)

On the other hand, because of the lower bound property we have:

do(o2) ≥ do(o1)

which is impossible.

This Lemma guarantees that all these data objects, whose partial featuredistance according to the index they were retrieved, is larger than the objectdistance of k farthest nearest neighbor, are not examined. This property issimilar to the property of the optimal multi-step nearest neighbor algorithm[109], defined as r − optimality.

3.2.5 Index Scheduling

In order to improve the performance of the algorithms, it is important toreduce the number of candidates that have to be examined. This numberlargely depends on how fast the termination inequality (condition of the while-loop in Algorithm 2) is satisfied in our algorithms. Therefore, we developeffective strategies for computing a so called activity schedule that determinesin which order the indexes are visited. The goal of the activity schedule is afast enlargement of the sum of the local thresholds mini. The naive strategyis to use a round-robin schedule. This strategy however does not give priorityto those indexes which are more beneficial for increasing the local thresholds.Similar to [61], we develop a heuristic based on the assumption that indexesthat were beneficial in the recent past will also be beneficial in the near future.

Our activity schedule takes into account the current value of the partialthreshold mini as well as the number of candidates ci an index has alreadydelivered. The more candidates an index has delivered the less priority shouldbe given to the index. Each time a candidate is received from an index, weincrement the counter ci. If the candidate has been already examined (by an-other index), we give an extra penalty to the ith index by incrementing ci onceagain. Our activity schedule then gives the control to the local incrementalalgorithm with maximum mini

ci. Note that our heuristic prevents undesirable

situations where one index keeps control for a very long period.In case of TA-Index+ the distances to index entries influences not only the

thresholds, but also the activity schedule. The internal nodes help to exploitthe R-trees dynamically in a more beneficial way such that the number ofexamined candidates is reduced. This is an important advantage against thecommon TA [45].


3.2.6 Discussion

Each of the algorithms follows our framework in a different way. While Best-Fit is a modification of the classical incremental algorithm [68], TA-Indexsupports arbitrary queries due to the fact that the indexes efficiently computean appropriate sorting of the data. The advantages of both approaches, areelegantly combined in the TA-Index+ algorithm, which behaves similar toTA-Index, but additionally, as it is the case for Best-Fit, exploits the internalentries of the index structure. The advantage of these algorithms is that theyrun on top of already existing multidimensional index-structures, e.g. R-trees,that support an incremental processing of nearest neighbor queries.

Let us now discuss the extension of our framework towards three impor-tant directions. First, in our framework the low-dimensional indexes currentlymaintain partial data points for processing k-NN queries. It is also possible tokeep the entire data points in the leaves. This offer the advantage that MDFis not necessary anymore and therefore, IOcand = 0. However, the cost forprocessing the local incremental queries increase since, in case of the R-tree,the capacity of the leaves is reduced by a factor of d. This also leads to highertrees and might increase their update cost too. Moreover, the storage cost willalso increase by a factor of d. Second, our algorithms can also take additionaladvantage when k-NN queries should be processed where k is indeed givenin advance and where the user is not interested in receiving the first answersearly. This goes along with the different kind of techniques [107] known fromclassical algorithms.

Finally, our algorithms based on index-striping are more flexible and effi-cient when users are allowed to give weights w1, ..., wd to the attributes whenan incremental query is issued. This means that the user is interested in thenearest neighbor with respect to the weighted distance function:

L2(l1, ..., ld) =√

(w1 ∗ l12 + w2 ∗ l22 + ... + wd ∗ ld2).

Then, our activity schedule can take advantage from these weights byprocessing those indexes more frequently that have received higher weights.An extreme case would be that the user is only interested in nearest neigh-bor queries of a lower-dimensional subspace, i.e., the weights of the otherattributes are 0 (known also as the projected nearest neighbor [66] problem).If this subspace even belongs to a single index, we are able to process theglobal query by using the results of only one index.

3.3 A Cost Model Based on Power Law

Accurate cost models are important to our algorithms for decision making.We are primarily interested in cost models for estimating the cost for k-NNqueries. As surveyed in the next subsection, the existing models are not suit-able for query optimization because they either suffer from serious inaccuracy

3.3 A Cost Model Based on Power Law 69

or involve excessively complex integrals that are difficult to evaluate in prac-tice.

3.3.1 Overview of Existing Cost Models

There are many different cost models that have been developed for R-trees,such that [120]. Analysis of k-NN queries aims at predicting: (1) the nearestNN distance (the distance between the query and the kth nearest neighbor),and (2) the query cost in terms of the number of index nodes accessed or,equivalently, the number of nodes whose MBRs intersect the search region.

The earliest models for nearest neighbor search [51], [38] consider onlysingle (k = 1) nearest neighbor retrieval assuming the L1 metric and N →∞,where N is the total number of points in the data set. These first models weredeveloped for uniformly and independently distributed data. Sproull [114]presents a formula suggesting that, in practice, N must be exponential withthe dimensionality for the models of [51] and [38] to be accurate. When thiscondition is not satisfied, these models yield considerable errors due to theso-called boundary effect, which occurs if the distance from the query point toits kth nearest neighbor is comparable to the axis length of the data space.

The first work [4] that takes boundary effects into account also assumesthe L1 metric. Papadopoulos and Manolopoulos [103] provide lower and up-per bounds of the nearest neighbor query performance on R-trees for the L2

norm. Bohm [24] points out that these bounds become excessively loose whenthe dimensionality or k increases and, as a result, they are of limited usein practice. However, when independency of dimensions is assumed the mostaccurate model is that of Berchtold et al. [18] and Bohm [24]. These modelsare valid only for uniform data distribution. Bohm [24] and other authors[100], [87] extend the solution to non-uniform data by computing the fractaldimension of the input data set. The usage of the fractal dimension has leadto more accurate cost models since multidimensional data tend to behave ina fractal manner. Our approach employs such a cost model for automaticallysetting important parameters.

3.3.2 The Proposed Model

We decided to use the fractal dimension as the basis for our cost model. Thereason is twofold. First, there are many experimental studies, see [87], showingthat multidimensional data sets obey a power law. Second, cost models forhigh-dimensional indexing, which go beyond the uniformity assumption, oftenemploy a power law [24], [87]. Therefore, the fractal dimension seems to bea natural choice for designing a cost model. As with the previous models, wefollow a two-step method: First we estimate the nearest NN distance, andthereafter we estimate the number of nodes whose MBRs intersect the NNsphere, focusing on real data sets modeled with the fractal dimension.


The cost of our algorithms is expressed by a weighted sum of I/O and CPUcost. The I/O cost is basically expressed in the number of page accesses. Aindex

refers to the number of accesses incurred from the local query processing,whereas Acand is the number of page accesses required to read the candidatesfrom MDF . Note that Acand is equal to the total number of candidates ifno buffer is used. The CPU cost consists of two parts. First, the processingcost for the temporary data structures like PQMerge and HTTID. The morecrucial part of the CPU cost might arise from the distance calculations.

Estimating the NN Distance

Given a set of points P with finite cardinality N embedded in a unit hypercubeof dimension d and its fractal (correlation) dimension D2. Let us assume thatthe data does not follow a uniform and independent distribution and thatthe query points follow the same distribution as the data points. The averagenumber of neighbors nb(r,D2) of a point within a region of regular shape andradius r obeys the following power law:

nb(r,D2) = (N − 1) ∗ V ol(r)D2d (3.2)

where V ol(r) is the volume of a shape (e.g. cube, sphere) of radius r, see[87] for more details. The average Euclidean distance r of a data point to itskth nearest neighbor is given by:

r =(Γ (1 + d

2 ))1d√

π∗ (

k

N − 1)

1D2 (3.3)

Now, we show that there is a strong relationship between the fractal di-mension and the performance of nearest neighbor queries, especially in thecase of multi-step algorithms. Since the ratio k

N−1 is always smaller than 1,an interpretation of the above formula reveals that a lower fractal dimensionleads to a smaller average distance. A multi-step algorithm [109], where oneindex is used for indexing a subspace, is called optimal if exactly the pointswithin this region are examined as candidates. The number of candidatesthat have to be examined by the optimal multi-step algorithm depends on thefractal dimension D′

2 of the subspace, i.e. partial fractal dimension [81], withembedded dimensionality d′, as shown in the following formula:

nb(r,D′2) = (N − 1) ∗ V ol(r)

D′2d = (N − 1)

(√

π ∗ r)D′2

Γ (1 + d′2 )

D′2

d′(3.4)

The above formula shows that a higher fractal dimension produces fewercandidates for the same radius. Thus, it is obvious that the performance of anoptimal multi-step algorithm depends on the fractal dimension of the subspacethat is indexed. In our multi-step algorithm, where more than one indexes are

3.3 A Cost Model Based on Power Law 71

used, a region ri of each index i, (1 ≤ i ≤ n), with r =√∑

1≤i≤n r2i (assuming

the L2 metric) is dynamically exploited.The average number of candidates for an index on a subspace with finite

cardinality N , embedded in a unit hypercube of dimension di and fractaldimension D2,i, is given by:

nb(ri, D2,i) = (N − 1) ∗ V ol(r)D2,i

di = (N − 1)(√

π ∗ r)D2,i

Γ (1 + di

2 )D2,i

di

(3.5)

The total number of candidates is:

nb(r) =∑

1≤i≤n

nb(ri, D2,i) (3.6)

In the above formula we include the duplicates that are eliminated duringthe process of our algorithm, but the number of duplicates is of minor impor-tance for our cost model. For the evaluation of this formula some assumptionsare necessary. The goal of our cost model (and the dimension assignment al-gorithms) is to establish a good performance for all indexes. Therefore, weassume that di = d

n = d′ and D2,i = D′2 for each index i. If this assumptions

hold it is expected that the indexes are equally exploited based on the activityscheduler, thus ri = r√

n= r′.

Estimating the Number of Node Accesses

The I/O cost can be one page access per candidate in the worst case. Asobserved in [28] and as confirmed by our experiments this assumption is overlypessimistic. We, therefore, assume the I/O cost of the candidates to be thenumber of candidates divided by 2:

Acand(r) =nb(r)

2(N − 1)

(√

π ∗ r)D′2

Γ (1 + d′2 )

D′2

d

(3.7)

The I/O cost of processing a local k nearest neighbor query is equal tothe cost of the equivalent range query of radius ri. The expected number ofpage accesses for a range query can be determined by multiplying the accessprobability with the number of data pages N

Ceff, where Ceff is the effective

capacity. Based on the previous assumptions, the total cost of processing thelocal queries is:

Aindex(r) = nb(r)N

Ceff(

∑

0≤j≤d′(d′j )((1− 1

Ceff) D′2

√Ceff

N − 1)j

√πd′−j

Γ (d′−j2 + 1)

(r′)d′−j)D′2d′

(3.8)


Number of Indexes

Our algorithms performs incremental nearest neighbor queries in a high di-mensional data space by exploiting multiple low-dimensional indexes. There-fore, we have to decide how many indexes should be created. The cost forprocessing a local nearest neighbor query within the lower-dimensional sub-spaces increases with an increasing partial fractal dimension. Therefore ourcost model defines the number of indexes where the sum of all I/Os is mini-mized.

Node Accesses

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

1 2 3 4 5 6 7

Number of Indexes

Fig. 3.2. Total Cost of k-NN query

Figure 3.2 shows the total cost over n indexes in a database consisting of1,312,173 Fourier points in a 16-dimensional data space. The fractal dimensionD2 of the dataset is 10.56. We evaluate the parameter k by setting the expectedaverage selectivity of a query as 0.01%. There is a clear minimum in n = 2meaning that the cost model suggest that we should create 2 indexes. In ourcost model we assume that the indexes currently maintain partial data points,whereas it might also be possible to keep the entire data points (in the leaves).This offers the advantage that MDF is not necessary anymore and therefore,Acand = 0. However, the cost for processing the local incremental queriesincrease since, in case of the R-tree, the capacity of the leaves is reduced bya factor of d.

3.4 Dimension Assignment Algorithms

In this section, we sketch our algorithms for assigning d dimensions to nindexes, n < d. Based on the observation that subspaces with low fractaldimension produces more candidates, we follow two basic goals. First, eachof the n indexes should have a high fractal dimension. Second, the fractaldimension should be equal for all indexes and therefore the indexes shouldperform similarly. However, in cases where it is impossible to qualify these two

3.4 Dimension Assignment Algorithms 73

criteria we present also an algorithm for the case where we produce a non-equal assignment of the dimensions. These goals conform to the assumptionsof our cost model and enhance the applicability of the cost model. Besides,we examine also in the experimental section (Section 3.6) the straightforwardapproach of assigning the attributes to the indexes in a round-robin fashion.

In this subsection, we present three algorithms for computing an assign-ment of the attributes. In general, an optimal solution of the dimension as-signment problem is not feasible because of the large number of attributes.Instead, we employ simple greedy heuristics which are inexpensive to computewhile still producing near-optimal assignments.

3.4.1 Linear Greedy Algorithm

The Linear Greedy Algorithm (LGA) starts with computing the partial fractaldimension for each attribute. Then, the next index receives the attribute withthe highest partial fractal dimension that has yet not being assigned previ-ously. Note that we circulate through the indexes to assign an equal numberof attributes to each of the indexes. The code of this algorithm is given inAlgorithm 3.

Algorithm 3 LGA: Linear Greedy Algorithm1: Input: d dimensions, n indexes2: Output: An assignment of the d attributes to the n indexes3: Compute the partial fractal dimension of each attribute using the box-counting

algorithm;4: Sort the list of attributes according to their fractal dimension;5: while (the list of attributes is not empty) do6: Let i be the next index in round-robin fashion;7: Assign the attribute with maximum partial dimension to index i;8: Remove this attribute from the list;9: end while

The cost of the linear greedy algorithm is dominated by the cost of com-puting d partial fractal dimensions. We use the box counting algorithm thatis linear in the size of the dataset [49]. The processing cost can be reduced bycomputing the fractal dimension of a sample of the dataset.

3.4.2 Quadratic Greedy Algorithm

The linear greedy algorithm is an inexpensive method for computing an ap-proximate solution. The problem however is that there might be a correlationbetween the partial fractal dimensions of the attributes. Let us consider twoattributes with a high fractal dimension. Then, the fractal dimension of thedata space of the two attributes might be lower than the sum of the single


partial dimensions. The Quadratic Greedy Algorithm (QGA) overcomes thisproblem, but, as anticipated from the name, the computation cost is quadraticin the number of dimensions.

Algorithm 4 QGA: Quadratic Greedy Algorithm1: Input: d dimensions, n indexes2: Output: An assignment of the d attributes to the n indexes3: Compute the partial fractal dimension of each attribute using the box-counting

algorithm;4: Sort the list of attributes according to their fractal dimension;5: for (each index i) do6: Assign the attribute with the maximum partial dimension to the index;7: Remove this attribute from the list;8: while (the list of attributes is not empty) do9: Let i be the next index in round-robin fashion;

10: for (for each remaining attribute A) do11: Calculate the partial fractal dimension of the attributes of the index i

and A;12: Assign the attribute with the highest increase of the partial dimension

to index i;13: Remove this attribute from the list;14: end for15: end while16: end for

The quadratic greedy algorithm is presented in Algorithm 4. Similar to thelinear algorithm, we first start with assigning the n attributes with the highestpartial fractal dimension to the indexes, where every index receives exactlyone attribute. In the fourth step, we iterate over the indexes in round-robinorder and assign to each of the indexes a new attribute. The index receives theattribute that gives the largest increase in the partial fractal dimension of theindex. Therefore, the algorithm calls O(d2) times the box-counting algorithm.

3.4.3 Non-Equal Greedy Algorithm

This algorithm, termed as Non-equal Greedy Algorithm (NGA) starts withassigning the n attributes with the highest partial fractal dimension [81] tothe indexes, where every index receives exactly one attribute. In the nextiterations, an attribute, which has yet not being assigned to an index, isassigned to the index with minimum fractal dimension. The index receivesthe attribute that maximize the fractal dimension of the index. Notice thatnot all indexes have necessarily the same number of attributes.

Our algorithm is outlined in Algorithm 5. The function D2(S) calculatesthe fractal dimension of a subspace S. The parameter Ai represents the ith

attribute; L refers to the set of unassigned attributes, and dai is the set of

3.5 NN Search Using External Priority Queues 75

Algorithm 5 NGA: Non-equal Greedy Algorithm1: Input: L = {A1, ..., Ad};2: Output: An assignment of attributes to indexes3: for ((i = 1, 2, ..., n)) do4: Amax ← argmaxA∈L{D2(A)};5: dadim ← Amax

⋃dadim;

6: L ← L− {Amax};7: while (L is not empty) do8: dim ← argmin1≤j≤n{D2(daj)};9: Amax ← argmaxA∈L{D2(A

⋃dadim)};

10: dadim ← Amax

⋃dadim;

11: L ← L− {Amax};12: end while13: end for

attributes assigned to the ith index. The algorithm calls O(d2) times the box-counting algorithm [14] which calculates the fractal dimension in linear timew.r.t. the size of the dataset.

3.5 NN Search Using External Priority Queues

The cost of priority queue operations plays a role in the performance of ourincremental nearest neighbor algorithms. The larger the queue size gets, themore costly each operation becomes. Also, if the queue gets too large to fitin memory, its contents must be stored in a disk-based structure instead of inmemory, making each operation even more costly. An example of the worstcase of the queue size for our R-tree incremental nearest neighbor algorithmsarises when all leaf nodes are within distance dist from the query object q,while all data objects are farther away from q than dist.

In this section we study the problem of maintaining a priority queue on ourframework complying the two-level memory architecture. That is, a fast inter-nal memory and a slow external memory. We assume that the I/O subsystemis responsible for transferring data between internal and external memory inblocks of a fixed size. The processor can only access data stored in internalmemory and the capacity of the internal memory is assumed to be boundedso it might be necessary to store part of the data in external memory.

3.5.1 Overview of External Priority Queues

In cases where the priority queue exceeds the size of available memory it mustbe stored in whole or in part in a disk-resident structure. One possibility is touse a B-tree structure to store the entire contents of the priority queue. Withproper buffer management, we should be able to arrange that the B-tree nodesthat store elements with smaller distances (which will get dequeued early) will


be kept in memory. However, we believe that when the priority queue actuallyfits in memory, using B-trees will be considerably slower than using fast heap-based approaches, since the B-tree must expend more work on maintainingthe queue elements in fully sorted order. In contrast, heap methods imposea much looser structure on the elements. A hybrid scheme for storing thepriority queue, where a portion of the priority queue is kept in memory anda portion is kept on disk, therefore seems more appropriate.

There exist several approaches which deal with the problem of implement-ing PQs in a main memory environment, such as Chained Lists, Binary Heaps,and Binomial Heaps. However, in the disk model these approaches do no longerwork, except that the assumption of the ’Virtual Memory’ is made. Existingapproaches for the disk model include the External Radix Heaps, ExternalArray Heaps, and Sequence Heaps.

Sequence Heaps, are very effective for cached memory and are also goodfor external disk. The structure of sequence heaps is similar to the externalArray Heaps structure (see Figure 3.3). Sequence heaps can be further opti-mized by storing the unordered sequence using an internal priority queue. Byfollowing this methodology, the smallest object is returned and the next ob-ject is inserted to the priority queue. The structure of this optimized variantis a mix of array heaps and sequence heaps (Figure 3.4).

Fig. 3.3. Sequence Heaps.

3.5.2 External Priority Queues within Index-Striping

In this section, we study the problem of using external priority queues withinour framework. Two issues regarding sequence heaps needs a careful discus-sion. The first concerns modified sequence heap, where all group buffers arekept in memory. Using this modification, the sequence heap is becoming veryefficient. Unfortunately, in order to determine the parameter k, (i.e. the num-ber of slots in a group) the number of groups should be previously determined.Then, it is possible to determine the parameter k needed for the sequenceheap, so that all needed components are kept in memory. However, it is not


Fig. 3.4. Optimizing Sequence Heaps.

clear how many objects in the priority queue should be inserted. We simplyassume that the maximum number of enque-operations of all index and dataentries constitute the index. This number is henceforth called as the maxi-mum limit and the sequence heap can be configured based on this parameter.As experiments indicated, this number is sufficient and in practical scenariosthe number of inserted objects is much lower than this maximum limit.

The second issue that needs a careful discussion concerns the followingproblem: Only the first blocks of all group buffers are stored in memory.Then, the determination of the parameter k does not depend on the numberof groups. Unfortunately, for dequeue-operations we have to calculate withadditional I/Os, simply because every object in the sequence heap has to bewritten in extra in external disk. The sequence heap is modified, so that ac-cording to the number of available objects (i.e. the number of groups) its struc-ture adaptive changes. This has the following advantages: (1) the memory isoptimized and (2) additional I/Os are sufficiently reduced. The sequence heapwhich dynamically adapts its structure, is henceforth called ’self-adaptive’.

Self-adaptivity

The main idea behind self-adaptivity is that a dynamical change of the storagelocation of the group buffer takes place. Because each time before the reorga-nization of the insert buffer, in external memory the two buffers (insert andgroup buffer) are merged, this has as a consequence the fact that the groupbuffer is always kept in memory. The remaining group buffers are stored inmemory whenever this is needed.

The example in Figure 3.5 shows how the sequence heap structure dy-namically adapts. In the beginning in Figure 3.5(a) the three buffers (group,insert and merger) are initialized according to the parameter k in memory.The gray area in this figure corresponds to the free space. In case in whichmore objects are inserted, this has as a consequence the origination of group2. Then, group buffer 2 has to acquire space in memory also. At this stageit is assumed that group buffer 2 fits in memory as shown in Figure 3.5(b).The main memory is now full and as a consequence there is no free space for


Fig. 3.5. ”Self-Adaptive”.

group buffer 3. In case in which group 3 needs space, the memory has to makeroom for group buffer 3, and this is done by moving the last half blocks (fromgroup buffer 2) to external disk. Group buffer 3 stores the first half blocks inmemory as it is illustrated in Figure 3.5(c). Because now the half blocks ofgroup buffer 2 and 3 are in external disk, we have to pay additional I/Os fordequeue-operations. However, these additional I/Os correspond only to thehalf objects of groups 2 and 3, simply because only the half of the objects ofthose groups in retrieving the group buffer have to be inserted to the externalpart of the group buffer. Figure 3.5(d) shows the partitioning of the memoryin case of four groups. Group buffer 2 and 3 have to store more blocks on ex-ternal disk in order to make room in memory for group buffer 4. If we furtherassume that the maximum limit is 4 groups, then the resulting partitioningis illustrated in Figure 3.5(e).

Multiple External Priority Queues

In our TA-Index and TA-Index+ algorithms multiple queries on selected in-dexes are processed, thereby multiple external priority queues are requiredwhich access the memory on the same time. For this reason, every externalpriority queue is assigned a portion of the memory. A naive approach is toshare the memory on every sequence heap, so that each one receives the sameamount of memory. This way, every priority queue receives M/n from mem-ory, as demonstrated in Figure 3.6, assuming n priority queues requesting formemory M at the same time.

However, such an approach has essential drawbacks. In this naive ap-proach, every sequence heap has its own merge buffer. Besides, all sequenceheaps are required to be processed by our algorithms. This means that thesequence heaps are not running concurrently, but they are located in memoryat the same time. Because operations on a single sequence heap are processedone at a time, then it is possible to use a shared merge buffer for all sequenceheaps. In order to determine the size of the merge buffer, a simple strategy isto pick the biggest merge buffers of all available sequence heaps.


Fig. 3.6. The Naive Approach.

Our enhanced approach is to use a group buffer manager for the dynamicmanagement of the group buffers of all sequence heaps. The group buffermanager initializes the group buffer 1 of each sequence heap. As it was thecase for self-adaptive, all group buffer 1 are kept in memory. In case in which asequence heap requires a new group buffer, then the group buffer manager hasto reserve this space. However, the following problem may arise. Consider thescenario in which there is no room in memory for the new group buffer. Then,we may solve this problem by simply moving some blocks from the groupbuffer in memory to the external disk (as it was the case for self-adaptive).Because the manager now controls the group buffers of all sequence heaps,the manager is not able to choose only the group buffers as candidates (ofthat particular sequence heap). Unfortunately, the manager has to choose ascandidates also the group buffers of other sequence heaps. This results to morecandidates as in the case of self-adaptive with one sequence heap.

Fig. 3.7. The New Approach.

The example of Figure 3.7 illustrates the idea of the manager. We assumethat we have created two indexes and within the main memory three sequenceheaps are running. The two sequence heaps are responsible for the two indexes(i.e. local PQi) and one serves as PQMerge. For simplicity we denote thoseheaps as PQ1, PQ2 and PQ3. The group buffer j of PQi is denoted as Gbij .All sequence heaps are using the merge buffer. At an initialization step, thegroup buffer manager assigns for every sequence heap the corresponding groupbuffer 1 (Figure 3.7(a)). The gray area corresponds to the free space in memoryresulting after the initialization process. Figure 3.7(b) shows the partitioningof the free space to the group buffers in a subsequent step. In case of the new


group buffer 3 of PQ2 the free space is taken from Gb22 as shown in Figure3.7(c). This is done, by moving the half blocks on external disk. Becausethe group buffer manager coordinates the group buffers of all three PQs, thefree space in memory for Gb23 can be provided either from Gb22 or fromGb12. Therefore, Gb23 has more blocks on memory as shown in Figure 3.7(d).Although additional I/Os are required for PQ1, this is balanced by the numberof objects in group 3 of PQ2. The saving in I/Os resulting after moving lessblocks of Gb23 in external disk is grater than the additional I/O cost for PQ1.

3.5.3 Discussion

In this section we have shown how our incremental NN algorithms can benefitfrom the usage of external priority queues. The choice of a particular priorityqueue depends on the application at hand. Here we considered sequence heapsand we have examined various ways to tune the priority queue, so that nearestneighbor queries are executed more efficiently.

Clearly, there are various open topics, such as the investigation of differentstrategies for sharing the memory to the priority queues. A way to do this is byusing a heuristic which can predict the necessity of an index on memory. Theconcept of the dimensionality of the index or the fractal dimensionality is anexcellent indicator. A more advanced approach concerns the dynamic memoryallocation, by adapting the memory to the specific needs of the indexes duringquery processing. Another direction, which we consider as future work, con-cerns the required scheduler modification in order to examine in which waythe memory allocation influences our nearest neighbor algorithms.

3.6 Experimental Validation

For the experimental evaluation of our approach, indexes were created onthree data sets from real-world applications. All data sets contain features of68,040 photo images extracted from the Corel Draw database and have alreadybeen used very often for experimental performance studies of nearest neighborqueries. The first dataset contains 16-dimensional points of image features,while the second dataset contains 32-dimensional points where each objectcontains 32 numbers representing the color histogram of an image. The thirddataset is a 64-dimensional color histogram dataset. In all our experiments, weexamined incremental k-NN queries where the distribution of the query pointsfollows the distribution of the data points. Moreover, we report only the resultsobtained by using as a distance function the Euclidean distance L2. Similarresults were obtained from other distance functions. All our measurementswere averaged over 100 queries.

We implemented our index-striping framework in Java using the XXLlibary [42]. In all the experiments we use an R-tree [62] as the embeddedindex-structure for our framework. We used a public available implementation

3.6 Experimental Validation 81

of the R-tree that is similar to the R*-tree [13], but without using re-insertion[42]. We set the page size to 8K in all our experiments and each dimensionwas represented by a double precision real number. Note again that our ap-proach to index-striping is not limited to R-trees. Our absolute numbers maydramatically change when we use a different kind of index-structure or evenonly a different R-tree implementation. It is however beyond the scope of thiswork to examine index-striping for different kind of index-structures.

In the following subsections, we first examine the important design decisionof our algorithms. In particular, we start with the evaluation of our cost model(subsection 3.6.1). Afterthat, we study the greedy algorithms for dimensionassignment (subsection 3.6.2). In subsection 3.6.3 we examine the performanceof our k-NN algorithms (Best-fit, TA-Index and TA-Index+) by comparingthem for the processing of nearest neighbor queries. Finally, in subsection 3.6.4we present results where we compare index-striping to methods that employa single index on either all attributes or a specific subset of the attributes.

3.6.1 Evaluation of the Cost Model

In order to illustrate the accuracy of our cost model, we show in Figure 3.8the estimated page accesses and the actual page accesses measured during theexecution of 10-NN queries.

0

400

800

1200

1600

2000

16 32

Dimensionality

Page accesses

A_cand A_index

IO_cand IO_index

Fig. 3.8. Accuracy of our Cost Model

The plot shows the I/O cost for both data sets where the fractal dimen-sions of the 16-dimensional and 32-dimensional datasets are 7.5 and 7.1, re-spectively. Note that the relative error is less than 20% in our experiments.Furthermore, TA-Index+ produces fewer page accesses, due to the usage ofan activity schedule which is not considered by the cost model.

3.6.2 Examination of Attribute Assignment

In this subsection, we study the performance of a multidimensional indexthat is defined on a subspace of the dataset. We are particularly interested in


the impact of the partial fractal dimensions of the subspaces on the overallperformance. For that, we report the number of the retrieved candidates andthe number of page accesses (I/Os) for the local incremental algorithms.

Effect on the Fractal Dimension

The first experiment compares the number of candidates produced by thenearest neighbor algorithm when only one index is used on our 16-dimensionaldataset. The index was created on different 5-dimensional subspaces varyingin the partial fractal dimension. Figure 3.9 shows the average number of can-didates (filter step) over 100 k-NN queries, k = 10, for different partial fractaldimensions.

0

2000

4000

6000

8000

10000

12000

1,366 1,549 1,614 1,976 2,002 Fractal Dimension

Candidates

Fig. 3.9. Average number of candidates for 10-NN queries (5 attributes selectedfrom 16)

As claimed in Section 3.3, a subspace with a higher partial fractal dimen-sion produces fewer candidates. The experiments substantiate the general ob-servation that the number of candidates decreases with an increasing partialfractal dimension of the subspace.

In the next experiment, we examine the performance of our nearest neigh-bor algorithms for different attribute assignments that vary in the sum of thepartial fractal dimensions of the indexes. The purpose of the experiment isto illustrate the impact of the sum of the partial fractal dimensions on ournearest neighbor algorithms based on index-striping.

In Figure 3.10, we report the I/Os for different attribute assignments whenperforming 10-NN on the 16-dimensional dataset. Note that the lower partof the bar refers to the I/O required for reading the candidates, whereas theupper part of the bar represents the I/O of the local incremental algorithms.The results confirm that our strategy (maximizing the sum of the partialfractal dimensions) indeed leads to the best performance. As predicted, how-ever, the I/Os for the local incremental algorithms increases for larger fractaldimensions.


0

50

100

150

200

250

300

350

400

450

4,387 4,788 5,387 6,021 Fractal Dimension

I/Os Filter I/O Refinement I/O

Fig. 3.10. Average number of I/Os for 10-NN queries for different attribute assign-ments

Effect on the Assignment Strategy

The plot in Figure 3.11, contains three curves reporting the number of can-didates for a specific assignment strategy as a function of k, where k denotesthe desired number of neighbors. More particularly, we compared the threestrategies resulting in equal number of attributes per index. These are, thenaive round-robin assignment strategy, the linear greedy algorithm and thequadratic greedy algorithm.

0

500

1000

1500

2000

2500

0 50 100 150 200 Query Parameter k

Candidates

Round-Robin

LGA

QGA

Fig. 3.11. Average number of candidates of k-NN queries as a function of k (differentassignment strategies)

These results demonstrate that the round robin strategy leads to substan-tially more candidates that our greedy strategies. Moreover, there is only alittle difference between the two greedy strategies, where the quadratic al-gorithm still produces a slightly smaller number of candidates. As expected,as k increases the number of candidates increase. However, the increasing ofthe parameter k affects more the performance of the round robin assignment.Additionally, the rate of increasing of the quadratic attribute assignment isslower.


Our experiments confirm that an index on a subspace of the dataset withhigher fractal dimension using a multi-step nearest neighbor algorithm pro-duces fewer candidates.

3.6.3 Comparison of our k-NN algorithms

As in the first set of experiments, the results reported in this subsection alsorefer to the 16-dimensional dataset with 68.040 points. We compared ourdifferent algorithms, namely Best-Fit, TA-Index and TA-Index+. We also ex-amined the impact of the activity schedule on the two latter algorithms. Allthree algorithms use the same set of indexes and the same assignment of at-tributes to these indexes. The assignment was generated by using the lineargreedy algorithm. Overall, there are three R-trees of dimensionality 6,5 and5, respectively. Again, we performed k-NN queries for k = 10, 20, 50 and 100.

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 20 40 60 80 100 Query Parameter k

Candidates

Best Fit TA-Index TA-Index+

(a) Number of candidates

0

50

100

150

200

250

300


I/Os

TA-Index

TA-Index+

(b) Number of I/Os

Fig. 3.12. Average number of candidates and I/Os as a function of k.

The left diagram in Figure 3.12 depicts the number of candidates generatedby the three algorithms. Best-Fit performs poorly, whereas TA-Index and TA-Index+ produces a significant lower number of candidates. Recall that for eachcandidate it is required to read the entire point from disk and to compute the(expensive) global distance function. Therefore, the number of candidates isalso an excellent indicator for the CPU cost.

The right diagram in 3.12 compare the performance of TA-Index and TA-Index+. Recall that TA-Index+ exploits the information of the internal nodesof the R-tree to fasten the enlargement of the partial thresholds. Moreover,TA-Index+ offers more stop points where the activity scheduler might decideto give control to a different index. In general, TA-Index+ uses this informa-tion for a better navigation in the indexes where the expansion of unnecessaryleaf pages is avoided. Therefore, this results in less page accesses in comparisonto running TA-Index.


Effect on Index Scheduling

In the next experiment, we consider different strategies of the activity sched-uler. In Figure 3.13, we report the performance of a round robin strategy andour fine-tuned strategy. The graphs show that our strategy is consistentlysuperior.

0

100

200

300

400

500

600

700

800

900

1000


Candidates

Round Robin

Activity Scheduler

Fig. 3.13. Average number of candidates for k-NN queries as a function of k (dif-ferent scheduling strategies)

3.6.4 Comparative Study of Index-Striping and other Techniques

In this experiment, we compare the following four techniques: TA-Index+,Original Space Indexing (OSI), Global Dimensionality Reduction (GDR) andLinear Scan. We examined the I/O cost of k-NN queries for the 32-dimensionaldataset. For TA-Index+, we created two 16-dimensional R-trees as it wasrequired from our cost model. Each of the corresponding subspaces has afractal dimension of almost 3.75, resulting in well-performing R-trees. For theGDR method, we used the optimal multi-step algorithm on a 16-dimensionalR-tree. The results of our experiment are plotted in Figure 3.14.

0

500

1000

1500

2000

2500

3000

3500

4000

4500

2 4 6 8 10

Query Parameter k

Page accesses

TA-Index+ GDR OSI Linear Scan

Fig. 3.14. Comparative Study


The plot contains four curves reporting the I/O as a function of k, wherek denotes the desired number of neighbors. As illustrated, TA-Index+ con-stantly outperforms the other techniques in our experiment.

In the next experiment we use an 64-dimensional dataset which con-tains color histogram features extracted from the Color image collection. Thedataset has 22,472 elements. For this experiment we divide the original spacein four vertical subspaces and use four indexes. One index of lower dimen-sionality and one index over the whole data space is used for comparison.More specifically, we compare our algorithm against the multi-step algorithmof [109]. This algorithm was successfully applied in image database and fo-cuses on the candidate eliminations since in image databases often complexdistance function are used. We run 100 k-NN queries and we varied k.

0

500

1000

1500

2000

2500

3000


Candidates GDR TA-Index+

(a) Number of candidates

0

500

1000

1500

2000

2500

3000

3500

4000

4500


I/Os OSI TA-Index+

(b) Number of I/Os

Fig. 3.15. Average number of candidates and I/Os as a function of k.

In the left part of Figure 3.15 we compare our TA-index+ algorithm withthe optimal multi-step algorithm using one 16-dimensional R-tree and showthe number of candidates produced. As we can see our approach produceless candidates and so less global distance calculation are needed. On theother hand, the multi-step algorithm could use an expensive transformationand so to improve his performance. The right part of Figure 3.15 shows ourinternal page access cost (I/O cost) in comparison with a 64-dimensional R-tree. The performance of our approach using multiple indexes is better thanusing a single index. Additional disk accesses are needed for retrieving thedata objects. Even in the case where one I/O is needed for each candidate,our approach outperforms the one index solution. It is possible to have evenless random accesses by using a buffer.

Examination of External Priority Queues

In this experiment we examine the performance of our framework by usingexternal priority queues. For this purpose we used the uniform dataset whichhas a cardinality of 15,000 points and we varied the dimensionality from 9 to17. All external priority queues are assigned totally 100B of memory storage.

3.7 Conclusion 87

We compare index striping agains the Original Space Indexing (OSI) methodand we measure the average of 100 uniformly distributed 5-NN queries. Inorder to examine the scalability of the index-striping approach, we used index-striping with 2 and 3 indexes.

0

2000

4000

6000

8000

9 11 13 15 17 Dimenionality

I/Os n=1 n=2 n=3

Fig. 3.16. External Priority Queues

The parameter k required in the sequence heap for the OSI method, wasset to 25. Every single sequence heap used by index-striping was calculatedas proposed in Section 5.2. In Figure 3.16 we plotted the results. Our frame-work using two and three indexes aoutperforms the OSI method constantly,however when d is larger than 13 the approach using three indexes is supe-rior to that of two indexes and in higher dimensional spaces the differncesincreases. The conclusion drawn here, is the fact that the higher the dimen-sionality of the data space is, the better the performance of index-stripingby using multiple indexes. Therefore, index-striping seems to be a candidateindexing technique which is not affected by the curse of dimensionality forhigh-dimensional similarity search.

3.7 Conclusion

In this chapter, we revisited the problem of employing R-trees (or other multi-dimensional index-structures) for supporting k-NN queries in an iterative fash-ion and we presented a new approach to indexing multidimensional data thatis particularly suitable for an efficient incremental processing of nearest neigh-bor queries. The basic idea is to split vertically the data space into multiplelow-dimensional data spaces. The data from each of these lower-dimensionalsubspaces is organized by using a standard multi-dimensional index structurewhich is well performing and not suffering from the ”dimensionality curse”.

Our approach is independent from the specific index-structure. For a givenquery point, we are therefore able to exploit those R-trees dynamically whichgive the most benefit to our total query results. Moreover, due to the factthat every R-tree is responsible only for a small number of dimension, their


performance is not affected by the curse of dimensionality. It also worth tomention that our algorithm is build on top of existing R-tree implementationsand therefore it can be immediately applied in commercial DBMSs that offerR-trees and iterative nearest neighbor queries.

4

MiD: a Dimension Adaptive Indexing Methodfor Efficient Similarity Search

In this chapter, we propose a multidimensional extension of the original iDis-tance method [130], [77], which we term Multidimensional iDistance (MiD),to support fast k-NN search in high-dimensional spaces. Three main steps areperformed for building MiD. In agreement with iDistance, firstly, data pointsare partitioned into clusters and, secondly, a reference point is determinedfor every cluster. However, the third step substantially differs from iDistanceas a data object is mapped to a m-dimensional distance vector where m > 1generally holds. The m dimensions are generated by splitting the original dataspace into m subspaces and computing the partial distance between the ob-ject and the reference point for every subspace. The resulting m-dimensionalpoints can be indexed by an arbitrary point access method like an R-tree. Ourcrucial parameter m is derived from a cost model that is based on a powerlaw. Compared with the original iDistance technique, both theoretical anal-ysis and experimental studies have shown that MiD is superior to iDistancefor nearest neighbor queries.

The rest of the chapter is organized as follows. In the next section we dis-cuss previous work on similarity search and sketch our approach. Section 4.2reviews the iDistance technique. Section 4.3 presents the proposed indexingscheme, while Section 4.4 deals with the corresponding query processing tech-niques. Thereafter in Section 4.5, we provide a cost model relying on a powerlaw and we propose a formula for computing the appropriate parameter m..Section 4.6 contains an extensive experimental evaluation that demonstratesthe accuracy of the distance splitting and the efficiency of the proposed indexstructure. Finally, Section 4.7 concludes the chapter.

4.1 Introduction

Similarity search is a crucial task in many multimedia and data mining ap-plications and extensive studies have been performed in the area. Usually,the objects are mapped to multi-dimensional points and similarity search is

90 4 MiD: a Dimension Adaptive Indexing Method for Efficient Similarity Search

modeled as a nearest neighbor search in a multi-dimensional space. In thelast decade, many structures and algorithms have been proposed aiming ataccelerating the processing of k nearest neighbor queries. Early methods arebased on R-tree-like structures such as the SS-tree [125] and the X-tree [21].However, the R-tree-like structures all suffer from the ’dimensionality curse’,that is, their performance deteriorates dramatically as dimensionality becomeshigh. [123] has shown this phenomenon both analytically and experimentally.Therefore, the authors of [123] proposed an algorithm based on compression,called the vector approximation-file (VA-file) to accelerate sequential scan.

While the papers above mainly emphasize on the efficiency of k-NN search,other works look at k-NN from the aspect of effectiveness. Beyer et. al. [22],[111] show that at very high dimensionality, the distance between two nearestpoints and two furthest points in a data set are almost the same. At the sametime however, they also show that points that are generated from distinctclusters do not obey such rules. In [77], iDistance, which is based on thisobservation, was proposed to support k-NN search for high-dimensional data.The design of iDistance is motivated by the following observations. First, the(dis-)similarity between data points can be derived with reference to a chosenreference point. Second, data points can be ordered based on their distancesto a reference point, and the points which are near in high-dimensional spaceare expected to have similar distance to the reference point. Third, distanceis essentially a single value. Hence, high-dimensional data can be representedin single-dimensional space, making use of existing single-dimensional indexessuch as the B+-tree.

In this chapter, we propose a dimension-adaptive index method, calledMiD, that supports both range and k-NN queries under datasets with differentdimensionalities and data distributions efficiently. Our scheme is based on theiDistance technique, [77], which is primarily designed for k-NN queries. Whilethe iDistance technique is shown to be much better than previously proposedapproaches such as the VA-file [123], it has the following inherent deficiency:Because of the one-dimensional distance transformation, iDistance is lossyin the nature. It is possible that points with the same iDistance values areactually not close to one another - some may be closer to q, while others arefar from it. In these cases, a sequential scan over the data set may be moreeffective.

The basic structure of MiD consists of separate multidimensional indicesfor each cluster, connected to a single root node. In order to build the MiD,data points are partitioned into clusters and, a reference point is determinedfor every cluster. However, the choice of the transformation substantially dif-fers from iDistance, as a data object is mapped to a m-dimensional distancevector where m > 1 generally holds. The m dimensions are generated bysplitting the original data space into m subspaces and computing the partialdistance between the object and the reference point for every subspace.

4.2 Overview of the iDistance Technique 91

4.2 Overview of the iDistance Technique

The iDistance technique partitions the data space and defines a reference pointfor each partition. In a second step, the distance of each data point to thereference point of its partition is indexed. Since this distance is a simple scalar,with a small mapping effort to keep partitions distinct, a classical B+-tree isused to index this distance.

4.2.1 The Data Structure

In iDistance, high-dimensional points are transformed into a single-dimensionalspace. This is done using a three-step algorithm. In the first step, the high-dimensional data space is split into a set of partitions. In the second step,a reference point is identified for each partition. Without loss of generality,suppose we have n partitions (clusters), C1, C2, ..., Cn; their corresponding ref-erence points, P1, P2, ..., Pn, are selected either based on a pre-defined spacepartitioning strategy or data partitioning strategy [77]. In the third step, ev-ery data object is assigned a one-dimensional iDistance key according to thedistance to its clusters reference object. Having a constant c to separate indi-vidual clusters, the iDistance key for an object o ∈ Ci is:

iDist(o) = dist(Pi, o) + i ∗ c

Expecting that c is large enough, all objects in cluster Ci are mapped tothe interval [i ∗ c, (i + 1) ∗ c]. Figure 4.1 shows an example of the mappingvisualization.

In iDistance, two data structures are employed: (1) a B+-tree is used toindex the transformed points to facilitate speedy retrieval. The B+-tree isused for its wide availability in commercial systems. (2) an array is used tostore the n reference points and their respective nearest and furthest radiithat define the data space.

C 0

C 1

C 2

q P 0

P 1

P 2

P 0 P 1 P 2

r

Fig. 4.1. The principles of iDistance.


With appropriate choice of partition schemes, iDistance is one of the mostefficient access techniques with respect to k-NN search. It is true that thepower of pruning methods deteriorates with increasing dimensionality andparameter k, but it is less dramatic for iDistance. In [77], the authors showsthe improvement factor of iDistance with increasing data space dimensionality,compared with other techniques.

4.2.2 Query Processing

The iDistance k-NN search algorithm searches the index from the query pointoutwards, and for each partition that intersects with the query sphere, a rangequery is resulted. If the algorithm finds k elements that are closer than r fromq at the end of the search, the algorithm terminates. Otherwise, it extendsthe search radius by δr, and the search continues to examine the unexploredregion in the partitions that intersects with the query sphere. The process isrepeated till the stopping condition is satisfied.

The algorithm for k nearest neighbor search in the iDistance techniqueis based on repetitive range queries with growing radius r. An accessed ob-ject o is added into the k-NN answer set S if either |S| < k or dist(q, o) <dist(q, farthest(S, q)) where farthest(S, q) is the object from S with thegreatest distance from q. The sequence of range queries terminates and thek-NN search execution is completed when dist(farthest(S, q), q) < r and |S|= k. For each range iteration, the k-NN search algorithm stores informationabout the searched space (taking advantage of the B+-tree properties) inorder not to search any space redundantly.

The clusters that possibly contain objects from the query range, are allthese clusters that satisfy: dist(q, Pi) - r ≤ max− disti, where max− disti isthe maximum distance between Pi and objects in cluster Ci. Figure 4.1 showsthe clusters influenced by the query (C0, C1) and specifies more precisely thespace areas to be searched. Such an area within cluster Ci corresponds to theiDistance interval:

dist(Pi, q) + i ∗ c− r, dist(Pi, q) + i ∗ c + r

So, several iDistance intervals are determined; distance dist(q, o) is evalu-ated for all objects o from these intervals and the query answer set is createdas {o|dist(q, o) ≤ r}.

4.2.3 Motivation

Because of the distance transformation, iDistance is lossy in the nature. It ispossible that points with the same iDistance values are actually not close toone another - some may be closer to q, while others are far from it. Specifically,there are two kinds of distance-errors using iDistance: (1) the points with bigdifference in distance in the original data space become similar (or the same)

4.3 The MiD Technique 93

iDistance keys in the transformed space - Figure 4.2(a) and (2) the pointswith small distances in the original high dimensional data space become keyswith relatively big difference - Figure 4.2(b). Most distance errors, however, inthe iDistance technique are of type 1 because there are many points which arelocated around the reference point, and have almost the same distance to thereference point. On the other hand, the distance error of type 2 happens onlyin cases of points with small distance to each other which are distributed todifferent clusters. This is mostly happening to cluster-edge points and is nota frequent case. However, the performance of iDistance is influenced mostlyby distance errors which are of type 1. In this case all candidates located inthat area have to be examined.

C i

P i

a

b

a,b P i

(a) Type 1

C 1

P 1 a

c

c P i

C 2

P 2 d

d P 2

(b) Type 2

Fig. 4.2. Distance errors.

Figure 4.2(a) and (b) show these two cases of distance errors, respectively.Our proposed multidimensional extension of the iDistance attempts to addressthese deficiencies and due to its highly optimized distance splitting and datatransformation strategies is very efficient for k-NN queries.

4.3 The MiD Technique

In this section we present a variation of the iDistance technique which trans-forms the data in a low dimensional space instead of the one dimensional.This approach is a multidimensional extension of iDistance and is henceforthcalled ”Multidimensional iDistance” (MiD) technique. The main idea is stillthe same; distance information is indexed trough an index structure. Thedifference is explained in the following subsections; instead of a single onedimensional distance, in this extension we have multiple distances for everycluster which builds the multi-dimensional key of this cluster.

As in the original iDistance technique, we partition the dataset into clus-ters and for every cluster we choose a reference point. Before the distance cal-culation between a point from the cluster and its associated reference point,


(a) Database Example

a c

d

e

b

P i

0

0,5

1

1,5

2

2,5

3

3,5

4

4,5

0 1 2 3 4 5 6 7

d 1

d 2

P i 0 3.16 1 2

(b) iDistance on S1

a b

c d

e 0

0,5

1

1,5

2

2,5

0 1 2 3 4

f 1

f 2

P i

(c) MiD Mapping

Fig. 4.3. Example of Data Transformation.

we split the dimensions into m subspaces (m corresponds to the desired di-mensionality of the transformed space). In a second step we calculate the mpartial distances from the partial reference point to every partial point in thepartition (cluster). The result, which is the m distances, corresponds to them-dimensional key for this dataset. In the following this is refereed as themultidimensional iValue.

4.3.1 Data Transformation

In this subsection we address the main deficiency of the iDistance technique(see Section 4.2), namely the distance preservation deficiency. We presenta new mapping which tries to retain better distance information through amulti-dimensional transformation.

Given a space S defined by d dimensions, and a dataset P on S. Let usassume that the data space is divided into a set of partitions (clusters) anda reference point is identified for each partition. A multidimensional mappingdefines a new m-dimensional space S′ for each partition (1 ≤ m ≤ d). The mdimensions are generated by splitting the original data space into m subspacesand computing the partial distance between the object and the reference pointfor every subspace. Using the multidimensional iDistance extension, the fol-lowing mappings are defined: (1) the one-dimensional mapping, in which allpoints in P are mapped to one-dimensional space; (2) the m-dimensional map-ping, where the d dimensions are assigned to m subspaces forming the reducedm-dimensional space and, (3) the d-dimensional mapping, i.e. all points arerepresented in the d-dimensional space.

In the following we illustrate with an example how using a multidimen-sional iDistance variation our approach is able of having less distance errors.Figure 4.3(a) illustrates a database of five objects P = {a, b, ..., e} in a 3-dimensional data space S = {d1, d2, d3} forming a single cluster Ci. The pointsare transformed using the multi-dimensional iDistance extension and the dataspace in vertically partitioned into two subspaces. Let us further assume thatthe first subspace is S1 = {d1, d2} as shown in Figure 4.3(b), whereas thesecond subspace is S2 = {d3} (every point is projected on d3). For each point

4.3 The MiD Technique 95

we calculate two distance values. The one-dimensional distances between thepoints in subspace S1 and their corresponding reference point are calculatedusing the iDistance mapping (Figure 4.3(b)). In addition, one-dimensionaldistances are used of all points projected on S2. Finally, a 2-dimensional dataspace is defined by these partial distances. More particularly, for a point o ∈ Pwe have f1 =

√(o1 − P1)2 + (o2 − P2)2 and f2 =

√(o3 − P3)2 as shown in

Figure 4.3(c).Through the multi-dimensional extension we are able of reducing the dis-

tance errors. In the worst case the distance errors are as much as in theoriginal iDistance. For example, assigning the first and third attribute in thesame subspace, e.g. S1 = {d1, d3} and the remaining attribute on S2, wehave a distance error. This indicates that a multi-dimensional extension ofthe iDistance technique is strongly depended on the attribute assignment andinfluences the performance of the iDistance approach.

To summarize, the more subspaces in which the attributes are assigned thebetter the topology of the original data is achieved. At the same time increasesalso the dimensionality of the transformed data space. By a d-dimensionaldata space the data are better characterized by d-dimensional keys. Usinginstead m-dimensional keys (m ≤ d) increases the chance for a distance error.The smallest the parameter m is, the worst is the topology of the data space.On the other hand the performance of a multidimensional index structuredecreases with an increased value of the parameter m (dimension of the key).The remaining question is how to compute the total number of subspacesm, and how to assign the attributes into the subspaces. For presentationalpurposes, we assume that the parameter m is given and later in Section 4.4we provide a cost model for its computation.

4.3.2 Dimension Assignment

For a given cluster Ci, the problem we consider here is how to distribute thedimensions over the m subspaces. Our approach is based on the notion ofdistinct iValues. Recall that, in the original iDistance technique, the moredistinct iValues a dimension has, the lower is the change of a distance er-ror. Therefore we explore how the number of distinct iValues influences themultidimensional iDistance. The distinct values of a proposed grouping canbe calculated by using a sample of the points (in each cluster) and in a sub-sequent step to calculate for these points the iValues, and consequently thenumber of distinct iValues.

Our greedy algorithm works as follows: Given m subspaces (dimensionsets) Si and a set of unassigned dimensions, the subspace with the smallestnumber of distinct iValues is picked and the unassigned dimension that max-imizes the number of distinct iValues is added to this. At an initializationstep, the number of distinct iValues is calculated for each dimension sepa-rately, and the m dimensions with the highest distinct iValues are assignedto the m subspaces Si. At a next step, we pick the subspace Smin with the


minimum number of distinct iValues. Thereafter we calculate the distinct iVal-ues for each remaining dimension di merged with the picked subspace. Thedimension contributing to the subspace with the highest distinct iValues ischosen. If there are no other dimensions left, the algorithm terminates andoutputs the subspaces. The algorithm finishes with m subspaces, each of thosecontaining one ore more dimensions. Finally, each data point is mapped to anm-dimensional point according to the partial distances and is indexed by anR-tree. Our algorithm is illustrated in Algorithm 6, where dvSi represents thedistinct values of Si.

Algorithm 6 Dimension Assignment Algorithm1: Input: D = {d1, d2, ..., dd}2: Input: S = {S1, S2, ..., Sm}3: while (D is not empty) do4: Smin ← argminSi∈S {dvSi};5: m ← argmaxd∈D {dvSmin∪d};6: Smin = dm ∪ Smin;7: D = D − dm;8: end while9: output S

The algorithm finishes with m subspaces, each of those containing one oremore dimensions. Our greedy algorithm tries to choose those subspaces whichare more beneficial for a multidimensional iDistance variation and to supportefficient k-NN search.

4.3.3 Index Construction

The data structure of MiD consists of separate multidimensional indices foreach cluster, connected to a single root node (shown in Figure 4.4). The RootR contains the following information for each cluster Ci : (1) a pointer to theroot node of the cluster index Ri (i.e. the address of disk block containingRi), (2) the subspace dimensionality mi and (3) the reference point Pi. Thecluster indices maintains a multidimensional index Ii for each cluster Ci inwhich we store the reduced dimensional representation of the points in Ci.Any disk-based multidimensional index structure (e.g., R-tree, X-tree) can beused for this purpose. We used the R-tree in our experiments because of itswide availability in commercial systems.

Like other database indexes (e.g., B-tree, R-tree), the MiD index shown inFigure 4.4 is disk-based. But it may not be perfectly height balanced i.e. allpaths from root R to leaf may not be of exactly equal length. The reason isthat the sizes and the dimensionalities may differ from one cluster to anothercausing the cluster indices to have different heights. Given the transformation

4.4 Query Processing 97

C 1

C 2

C 3

P 1

P 2

P 3

(a) Data Space Division

m 1 -d

Root

Index on C 1 Index on C 2 Index on C 3

m 2 -d m 3 -d

(b) Index Structure

Fig. 4.4. The MiD Structure.

determining the mValue of a point p, it is a simple task to build an indexbased on the MiD technique. The details are shown in Algorithm 7.

Algorithm 7 Insertion Algorithm1: Input: pnew: new data point, MiD;2: Output: MiD;3: Ci ← findClusterID(pnew);4: mValue ← transform(pnew);5: InsertR-tree(mValue, MiD(Ci));6: return MiD

In order to dynamically insert a new point pnew, the routine findClusterID(pnew)returns the ID of the cluster that the new point pnew belongs to (line 3), andthen the index key of this new data point is obtained by the same transfor-mation as already explained above (line 4). Furthermore, the key of pnew isinserted into the corresponding R-tree (line 5). Finally the index is returned(line 6). Update and delete operations can be done analogously.

4.4 Query Processing

After the data space transformation and having indexed the keys, we nowdiscuss how the multi-dimensional iDistance is realized and how the queriesare processed. The main idea is the same as in the original iDistance technique:the multidimensional iValue of the query point is calculated and in a secondstep the corresponding clusters are searched.

4.4.1 Range Search Algorithm

Let Pi be the reference point for a data partition Ci. A data point p, in thepartition can be referenced via Pi in terms of its distance to it: dist(Pi, p).


Using the triangle inequality, it is straightforward to see that:

dist(Pi, q)− dist(p, q) ≤ dist(Pi, p) ≤ dist(Pi, q) + dist(p, q).

For a range query range(q, r) with a search radius r, we are interested infinding all points p such that dist(p, q) ≤ r. For every such point p, by addingthis inequality to the above one, we have: dist(Pi, q) − r ≤ dist(Pi, p) ≤dist(Pi, q) + r. Even though our approach is more general, we restrict ourapproach to the Lp metrics and more particular to the Euclidean distance(L2).

At first, the multidimensional iValue (i.e. dist(Pi, q)) of the query pointis calculated and is used as a start point for search. The corresponding rangequery is issued against the cluster which contains the query point. All datapoints which fall into the search region have to be examined and the distanceswith respect to the query point have to be checked against the search radius.

Algorithm 8 Range Search Algorithm1: Input: Query Point q, Radius r2: Output: the range query r of q3: S = {};4: for each partition Ci that intersects with the range query do5: R = execute range query(Ci, [dist(Pi, q)− r, dist(Pi, q) + r]);6: for each point p ∈ R do7: d = dist(p, q) //calculate real distance8: if d > r then9: R = R - p

10: end if11: end for12: S = S + R;13: end for

Our algorithm is depicted in Algorithm 8. We access the objects of thepartition Ci according to the distance: dist(Pi, p) =

√dist21 + ... + dist2m =√

d21 + ... + d2

d by an incremental search. Thereafter, a range query withdist(Pi, q)−r ≤ dist(Pi, p) ≤ dist(Pi, q)+r is executed for every partition, i.e.index Ci. The results are refined by computing the real distances and mergedwith the results of all the affected indexes.

4.4.2 k-NN Search Algorithm

As in the original iDistance, the search begins with a small radius which isincrementally enlarged. All data points which fall into the search region haveto be examined and the distances with respect to the query point have tobe checked against the search radius. The search is terminated as soon as k

4.4 Query Processing 99

nearest neighbors are found and the search radius greater is than the distancebetween the query point and the k-th nearest neighbor. Our algorithm isdepicted in Algorithm 9.

Algorithm 9 k-NN Search Algorithm1: Input: Query Point q, Radius δr2: Output: the k nearest neighbors of q3: r = 0;4: while (true) do5: pfurthest = furthest(S, q);6: if dist(pfurthest, q) < r and |S| == k then7: return S;8: end if9: r = r + δr;

10: S = rangeQuery(q, r);11: while (|S| > k) do12: S = S − furthest(S, q);13: end while14: end while

The search radius is initialized (line 3) and is constantly refined in line 9.In lines 6-7 the algorithm checks whether the termination condition is fulfilled.This means that all k nearest neighbors have been found and the increasing ofthe search radius does not produce any more data points which are closer tothe query point than the already founded candidates. After that the algorithmterminates and the set S is returned as the result. If the termination conditionis not fulfilled we have to scan all points which fall inside the search radius(lines 10-13) and this is realized as a range query (line 10). If there are morethan k data points in the result set S then the farthest points to the querypoint are deleted.

4.4.3 Incremental k-NN Search Algorithm

In contrast to the original iDistance approach, in this subsection we presentan incremental nearest neighbor algorithm. Each index corresponding to apartition Ci access incrementally the data points according to the distancedist(Pi, p). Since the points are accessed by increasing distance according tothe cluster center, a point p is the next nearest neighbor if the inequality:

max dist(Pi) > dist(Pi, q) + dist(p, q). (4.1)

holds for every cluster Ci, where max dist(Pi) is the distance of the lastpoint retrieved by the index corresponding to cluster Ci.

Our algorithm is depicted in Algorithm 10. First, the distance dist(Pi, q)of the query point to each cluster center Pi is calculated and is used as a


start point for the incremental nearest neighbor search on each index. Thealgorithm accesses in parallel all affected indexes producing sorted lists of thepoints for each cluster Ci. We can find the k nearest points when there is a setof at least k points for which the inequality 4.1 holds. The algorithm sets asthresholds the values of the last retrieved point of each index. The algorithmstops when the current k points all have distances larger than the thresholds.All data points that are retrieved have to be examined and the distances withrespect to the query point have to be calculated.

Algorithm 10 Incremental k-NN Search Algorithm1: Input: Query Point q, number of NN k2: Output: the k nearest neighbors of q3: max1, ..., maxm = 0;4: while (|S| < k) do5: if ! heap.isEmpty() then6: p = heap.peek();7: if ∀ 0 ≤ i ≤ m, maxi > dist(Pi, q) + dist(p, q) then8: S = S + {p};9: k = k + 1;

10: end if11: end if12: i = argmin1≤j≤m(|maxi − dist(Pi, q)|);13: p = next nearest neighbor(Ci, q);14: maxi = dist(Pi, p);15: heap.enqueue(p);16: end while

The indexes are able to deliver the data points in an appropriate orderand a local incremental NN query runs for every index concurrently (line 13).The retrieved points, which are delivered as the results of these local queries,are kept sorted according to the distance from the query point dist(p, q) usinga heap (line 15) and the thresholds are updated (line 14). In each iterationan index is chosen based on the minimum value of |maxi − dist(Pi, q)| (line12). The search is terminated as soon as k nearest neighbors are found andthe search radius greater is as the distance between the query point and thek-th nearest neighbor.

4.5 Theoretical Analysis

We decided to use the fractal dimension as the basis for our analysis in es-timating the cost for k-NN queries. In the following, we first estimate thenearest NN distance, and thereafter we estimate the number of candidates ina cluster that have to be examined and the number of clusters that intersectswith the query sphere. After that we estimate the number of nodes whose

4.5 Theoretical Analysis 101

MBRs intersect the NN sphere. Finally, our cost model defines the number ofsubspaces (i.e. the dimensionality of the indexes) where the sum of all I/Osis minimized.

4.5.1 Preliminaries

Given a set of points P with finite cardinality N embedded in a unit hypercubeof dimension d and its fractal (correlation) dimension D2. The dataset ispartitioned into a set of n partitions (clusters). Let us assume that the datadoes not follow a uniform and independent distribution and that the querypoints follow the same distribution as the data points. The average (Euclidean)distance r of a data point to its kth nearest neighbor is given by:

r =(Γ (1 + d

2 ))1d√

π∗ (

k

N − 1)

1D2 (4.2)

Our algorithm examines a region of each cluster Ci that intersects withthe query sphere. Under the assumption that the radius of the query is muchsmaller than the radius of the cluster Ci, in our cost analysis we assume thatfor each cluster that intersects with the query sphere a region of radius r isexamined. Therefore, the number of candidates that have to be examined bythe algorithm depends on the fractal dimension D′

2 of the partition (cluster),with embedded dimensionality m, as shown in the following formula:

nbi(r,D2,i) = (Ni − 1) ∗ V ol(r)D2,i

m = (Ni − 1)(√

π ∗ r)D2,i

Γ (1 + m2 )

D2,im

(4.3)

where d corresponds to the dimensionality of the cluster, D2,i is the fractaldimensionality and Ni the cardinality of the cluster.

The total number of candidates is:

nbi(r) =∑

1≤i≤n

nbi(r,D2,i) (4.4)

The goal of our cost model (and the dimension assignment algorithm)is to establish a good performance for all indexes. Therefore, we assume thatNi = N/n = Nc and D2,i = D′

2 for each cluster Ci. Now, we need to determinethe number of clusters that intersect with the query sphere. We follow the samemethodology as in [37]. The number of clusters that intersects with a querysphere nc(q, r) can be estimated as the sum of the probabilities that a clusterintersects with the query, i.e.:

nc(q, r) =∑

1≤i≤n

P (Ci)intersectswiththequery ≈ F (r(Ci) + r) (4.5)


where P (·) is the probability and F (·) the distance density function.Under the assumption of the homogeneity of the viewpoints [37], the num-

ber of clusters nc(q, r) is independent of the query point q, i.e. nc(r) = nc(q, r)for any point q. Additional, since the clustering is processed at an initial step,we can estimate the distance function by counting the frequencies of the dis-tances between the cluster centers using a histogram.

4.5.2 Estimating the Number of Node Accesses

The I/O cost can be one page access per candidate in the worst case. Asobserved in [28] and as confirmed by our experiments this assumption is overlypessimistic. We, therefore, assume the I/O cost of the candidates to be thenumber of candidates divided by 2:

Acand(r) =nc(r)

2(Nc − 1)

(√

π ∗ r)D′2

Γ (1 + d′2 )

D′2

m

(4.6)

The I/O cost of processing a local k nearest neighbor query (i.e. withina cluster) is equal to the cost of the equivalent range query of radius r. Theexpected number of page accesses for a range query can be determined bymultiplying the access probability with the number of data pages N

Ceff, where

Ceff is the effective capacity. Based on the previous assumptions, the totalcost of processing the local queries is:

Aindex(r) = nc(r)Nc

Ceff(

∑0≤j≤m

(mj )((1− 1

Ceff) D′2

√Ceff

N − 1)j

√πm−j

Γ (m−j2

+ 1)(r)m−j)

D′2m

(4.7)

4.5.3 Estimating the Parameter m

Our algorithm performs incremental nearest neighbor queries in a high dimen-sional data space by exploiting multiple low-dimensional indexes. The cost forprocessing a local nearest neighbor query within the lower-dimensional sub-spaces increases with an increasing partial fractal dimension. In order to cal-culate the appropriate m (i.e. the dimensionality of the indexes) we examinedifferent values of the parameter m and we count the number of I/Os requiredto find the k-NN. Therefore our cost model defines the appropriate numberof subspaces where the sum of all I/Os is minimized.

Figure 4.5 shows the total cost over n indexes in a database consisting of68,040 data points in a 9-dimensional data space. We evaluate the parameterk by setting the expected average selectivity of a query as 0.01%. There is aclear minimum in m=3.

4.6 Experimental Evaluation 103

0

20

40

60

80

100

120

140

160

1 2 3 4 5 6 7 8 9

Parameter m

Fig. 4.5. Total Cost of k-NN query

4.6 Experimental Evaluation

In this section we experientially confirm that the multidimensional iDistancetechnique is both scalable and superior to the original iDistance technique. Toverify the effectiveness of MiD, three types of dataset have been used as exper-imental data: (1) The real dataset comprises of the Color Histogram dataset1,containing 32-dimensional image features extracted from 68,040 images. Allthe data values of each dimension are normalized to the range of [0,1]. (2) Theuniform dataset is a random point set consisting of the points distributed uni-formly in the range of [0,1] in each dimension. In our evaluation, we generateduniform datasets with varied cardinalities and dimensionalities. (3) The clus-tered data set is a synthetic dataset whose default number of clusters is 30 andthe cluster centers are randomly generated in each cluster: the data followsthe normal distribution with the default standard deviation of 0.05. The datasize and its dimensionalities are the same as that of the uniform dataset.

In terms of implementation, we set the page size for each R-tree to 4Kwhile each dimension is represented by a real number and we keep the entirepoint in each low-dimensional index. All datasets are indexed as suggested byour greedy algorithm. Note that, the implementation of the original iDistancemethod is not I/O optimized as far as sequential scans are concerned. Inour multidimensional iDistance, the parameter m (i.e. number of subspaces)is calculated as suggested by our cost model. All experiments are measuredin terms of the average disk page access, as well as the CPU time over 100queries.

4.6.1 Effect of the Number of Dimensions

As the first experiment, we study the effect of the dimensionality on theperformance of 10-NN search using 100,000 clustered dataset with the dimen-sionality ranging from 9 to 32.

1 UCI KDD Archive, http://www.kdd.ics.uci.edu


0

500

1000

1500

2000

2500

3000

9 16 32

Dimensionality

Page Accesses

MiD

iDistance

Fig. 4.6. Effect of the Dimensionality.

Figure 4.6 compares the query performance differences in terms of I/O costas the dimensionality increases. It is shown that MiD outperforms the originaliDistance method since our index method can more effectively reduce thesearch region. As shown in this figure, the iDistance performs poorly in termsof I/O cost. It is observable that the performance gap in the I/O cost betweeniDistance and MiD becomes larger as the dimensionality keeps increasing. TheI/O performance for the iDistance degrades due to many additional internalnodes at the leaf-level to be accessed and the large number of false positives.Although the original iDistance method involves clustering and partitioningwhich helps prune faster and access less I/O pages, the gap between the I/Ocost of iDistance and that of MiD becomes slightly bigger as dimensionalityincreases, since it is hard for iDistance to effectively prune the search regionby only using a single one-dimensional distance.

4.6.2 Effect of the Dataset Size

In this experiment, we use the three types of data set mentioned above tomeasure the performance behaviors of 10-NN queries with varying number ofdata points. Exactly 100 various types of 10-NN queries are performed overthe synthetic datasets which consist of 9-dimensional uniform and clustereddata ranging from 50,000 to 100,000.

Figure 4.7 shows the performance of query processing in terms of CPUcost. It is evident that MiD outperforms the original iDistance method interms of the CPU cost for both datasets. It is interesting to note that theperformance gap between the two methods becomes larger since it is a CPU-intensive operation during query processing. For the uniform and clustereddatasets, the CPU cost of the two methods increase almost linearly as thedataset size increases, but the increase for iDistance is faster.

In Figure 4.8, the query performance evaluations with respect to the I/Ocost are conducted. The experimental result reveals that MiD yields consis-tent performance gain and is more efficient than iDistance. MiD is not very


0

500

1000

1500

2000

2500

3000

3500

4000

4500

50000 60000 70000 80000 90000 100000

Cardinaity

CPU (msec) MiD iDistance

(a) Uniform Dataset

0

500

1000

1500

2000

2500

3000

3500

50000 60000 70000 80000 90000 100000

Cardinality

CPU (msec) MiD iDistance

(b) Clustered Dataset

Fig. 4.7. Average number of CPU cost.

0

100

200

300

400

500

600

700

800

50000 60000 70000 80000 90000 100000

Cardinality

I/Os MiD iDistance

(a) Uniform Dataset

0

100

200

300

400

500

600

700

50000 60000 70000 80000 90000 100000

Cardinality

IOs MiD iDistance

(b) Clustered Dataset

Fig. 4.8. Average number of I/O cost.

sensitive to the data volume since the increase in data size does not increasethe height of R-tree substantially, and it can more effectively filter away pointsthat are not in the answer set. It is important to mention that the gap be-tween the I/O cost of our MiD and that of iDistance widens as the data sizeincreases. The I/O cost of MiD grows linearly with the increase of the datasetsize, and is also slightly better than the iDistance for the uniform datasetsince it almost scans all of the leaf-level nodes of the whole index.

4.6.3 Performance on k-NN Search

We now study the effect of the parameter k for retrieving k nearest neighborson our multidimensional iDistance. More particularly we vary the parameter kfrom 10 to 50. For the purpose of this experiment we use the uniform datasetwhich has a full space dimensionality of 9-d and a cardinality of 100,000 points.

In Figure 4.9(a), the x-axis represents the values of the query parameterk whereas the y-axis measures the page accesses required to calculate thek-NNs. We compare our multidimensional iDistance algorithm against iDis-tance. Our algorithm is consistently superior. Same observations hold also forthe candidates, shown in Figure 4.9(b).

In order to study the efficiency of our MiD on the real dataset (Figure10), which consists of 68,040 32-dimensional data points, we varied the query


0

200

400

600

800

1000

1200


I/Os iDistance

Multidimensional iDistance

(a) Number of I/Os

0

5000

10000

15000

20000

25000


Candidates iDistance

Multidimensional iDistance

(b) Candidates

Fig. 4.9. Average number of I/Os and candidates as a function of k (Uniformdataset).

0

200

400

600

800

1000

1200

1400

1600

1800

10 20 30 40 50

Query Parameter k

I/Os

MiD iDistance

(a) Number of I/Os

0

2000

4000

6000

8000

10000

12000

10 20 30 40 50

Query Parameter k

CPU (msec)

MiD

iDistance

(b) CPU cost

Fig. 4.10. Average number of I/Os and CPU cost as a function of k. (Real dataset)

parameter k between 10 and 50. Our index is constantly superior to iDistanceas far as page accesses are concerned. However, the CPU cost of the multi-dimensional iDistance increases almost linearly and reaches the cost of theoriginal iDistance. This is simply because of the additional cost in finding theNNs when the query sphere is enlarged and as a consequence more indexesare affected spending more time to merging the results.

4.6.4 Effect of the Parameter m

In this experiment, we study the effect of the number of subspaces (parameterm - index dimensionality) on the efficiency of 10-NN search by using the100,000 9-D uniform dataset. Figures 4.11(a) and 4.11(b) show that with theincrease of m, the efficiency of the k-NN search (including the I/O and CPUcost) first decreases gradually since the average search region is reduced asthe number of groups increases. Once m exceeds a threshold (e.g., 3), thesignificant overlaps of different MBRs lead to the high cost of I/O and CPUin the k-NN search. Therefore we should treat m as a tuning factor.

4.7 Conclusions 107

0

200

400

600

800

1000

1 2 3 4 5 6 7 8 9

Parameter m

I/Os

(a) m vs. I/O cost

0

1000

2000

3000

4000

5000

6000

7000

1 2 3 4 5 6 7 8 9

Parameter m

CPU (msec)

(b) m vs. CPU cost

Fig. 4.11. Effect of m on k-NN search.

4.7 Conclusions

In this chapter, we proposed a multidimensional extension of the originaliDistance method, which we term as Multidimensional iDistance (MiD). Threemain steps are taken in building MiD: first, all data points are grouped intoa set of partitions (clusters). Second, a reference point is identified for eachpartition. Finally, every data object is mapped to a m-dimensional distancevector where m > 1 generally holds. The m dimensions are generated bysplitting the original data space into m subspaces and computing the partialdistance between the object and the reference point for every subspace. Theresulting m-dimensional points can be indexed by an arbitrary point accessmethod like an R-tree.

Compared with the original iDistance technique, both theoretical analysisand experimental studies have shown that MiD can more effectively prune thesearch space.

Part III

Skyline Query Optimization

5

Constrained Subspace Skyline Computation

The set of skyline points present a scale-free choice of data points worthyof further consideration in many contexts. Informally, the skyline of a multi-dimensional data set contains the ”best” tuples according to any preferencefunction that is monotonic on each dimension. Subspace skyline queries [105],[131] have been recognized as a useful tool towards matching better the user’srequirements. Instead of calculating the skyline over all dimensions, subspaceskyline queries gives the user the choice to select the relevant dimensions ofhis/her interest and the subspace skyline is then returned containing the mostinteresting points. In this chapter we go beyond subspace skyline computationand we present an interesting generalization termed as constrained subspaceskyline.

This chapter is organized in the following way: In Section 5.1, we motivatethe need for efficient computation of constrained subspace skyline queries. InSection 5.2 we formulate the problem of constrained subspace skyline queriesand we present a threshold-based skyline algorithm (called STA), which ex-ploits multiple indexes and is optimized for uniform data. In Section 5.3 wesolve effectively the problem of using multiple points for discarding verticallydistributed data points during skyline computation for arbitrary distributions.Section 5.4 presents a workload-adaptive strategy for determining the numberof indexes and the assignment of dimensions to the indexes. The experimentalresults are presented in Section 5.5. Finally, Section 5.6 concludes this chapterwith a discussion on the presented algorithms.

5.1 Introduction

The skyline operator and its computation have attracted much attention re-cently. This is mainly due to the importance of various skyline results in appli-cations involving multi-criteria decision making. Given a set of d-dimensionalobjects, the operator returns all objects pi that are not dominated by anotherobject pj . Consider for example a database containing information about cars.

112 5 Constrained Subspace Skyline Computation

Each tuple of the database is represented as a point in a data space consistingof numerous dimensions. The x-dimension could represent the price of a car,whereas the y-dimension captures its mileage (shown in Figure 5.1). Accord-ing to the dominance definition, a car dominates another car because it ischeaper and has run fewer miles.

M i l

e a g

e r

e a

d i n

g

Price

O

p 4

1000 2000 3000

20K

40K

60K

80K

100K

4000 5000

120K

6000 7000 8000 9000 10000

Fig. 5.1. Constrained skyline in 2-d subspace

In practice, a user may only be interested in records within the price rangefrom 3,000 to 7,000 Euro and with mileage reading between 20,000 and 100,000miles. The example in Figure 5.1 shows that the traditional skyline (dashedline) fails to return interesting points. A constrained skyline query returnsthe most interesting points in the data space defined by the constraints, suchas a range along a dimension and the conjunction of all constraints forms ahyper-rectangle in the d-dimensional data space.

On the other hand, skyline analysis applications usually provide numerouscandidate attributes, and various users may issue queries regarding differentsubsets of the attributes depending on their interests. While the dimension-ality of the corresponding data space might be rather high, skyline queriesgenerally refer to a low dimensional subspace. In our running example, acar database could contain many other attributes of the cars, such as horse-power, age and fuel consumption. A customer that is sensitive on the priceand the mileage reading (2-dimensional subspace in Figure 5.1) would liketo pose a skyline query on those attributes, rather than on the whole dataspace. By combining constrained and subspace skyline queries, we provide apowerful operator that returns the really interesting objects to the user. Theconstrained subspace skyline queries form the generalization of all meaningfulskyline queries over a given dataset.

Maintaining a high-dimensional index to support constrained skylinequeries in arbitrary dimensionality is generally not suitable. It has been shownthat the performance of such high-dimensional indexes deteriorates with anincreasing number of dimensions [21]. Especially in applications which in-volve constrained subspace skyline retrieval, a multi-dimensional index overthe whole data space shows poor performance. As shown in [102], constrainedskyline queries, whose constraint regions cover more than a specific portion

5.1 Introduction 113

of the data space, are usually more expensive than regular skylines. Even arange very close to 100% is much more expensive than a conventional skyline.

The performance of indexes (such as R-trees), as used in [102], does notscale well with the number of dimensions. Only low-dimensional indexes, e.g.R-trees, seem to perform well in practice and for that reason have found theirplace in commercial database management systems (DBMS) [82]. Althoughthe dimensionality of a given skyline query will be typically small, the rangeof the dimensions over which queries can be composed can be quite large,often exceeding the performance limit of the indexes. For an index to be ofpractical use, it would need to cover most of the dimensions used in queries.In order to cover the entire range of the dimensions over which queries can becomposed, in this chapter we propose to build several indexes on small subsetsof dimensions (so that the union covers all the dimensions). Therefore, weare able to reuse existing low-dimensional indexing of general purpose, alsoemployed for other queries.

Although both constrained skyline queries and subspace skyline querieshave been independently examined in previous literature, the implications ofconstrained subspace skyline queries have not been addressed. Existing ap-proaches for subspace skyline computation may be classified into three cate-gories. The first category contains general purpose algorithms which can sup-port constrained subspace skyline with a slide modification. Such approachesincludes the BBS algorithm, proposed by Papadias et al. [102]. BBS is abranch and bound skyline computation algorithm that is based on nearestneighbor search on the indexed dataset to progressively output skyline pointsof a dataset indexed by an R-tree [62], [13]. A modification of this standardindex-based constrained skyline algorithm, that additionally simply ignoresthe irrelevant attributes, can be used for the constrained subspace skylinecomputation.

The second category contains approaches which precomputes all possiblesubspace skylines and therefore those approaches are able to answer a sub-space skyline on the fly. Pei et al. [105] and Yuan et al. [131] independentlyproposed the concept of the Skyline Cube (SKYCUBE), which consists of theskylines in all possible subspaces. The compressed SKYCUBE is proposedin [128] in order to deal with frequent updates. Even for applications thatare restricted to constrained subspace skyline computation, the whole datasetmust be indexed. It is not possible to pre-calculate the points of the full spaceskyline and their duplicates, since the result depends on the given constrains.

Finally, the third category contains the SUBSKY algorithm, which isrecently proposed in [119]. This approach is related to our work as it ad-dresses subspace skyline retrieval. However, this method transforms the multi-dimensional data into one-dimensional, and therefore permits indexing thedataset with a B-tree. SUBSKY is unable to answer effectively constrainedsubspace skyline queries as all points have to be transformed in a preprocess-ing step.


In order to support constrained skyline queries in an arbitrary subspace, wepartition vertically the data space among several low-dimensional subspacesand index each of these subspaces using an R-tree. The query is evaluated bymerging the partial result from all affected indexes and simultaneously prunesaway the dominated data spaces. The benefit of this approach is twofold. Onthe one hand, the low-dimensional queries are processed by low-dimensionaland well-performing indexes. On the other hand, this approach can easily beadapted to the underlying query workload. We distinguish between two dif-ferent cases. Firstly, the queried subspace is a subset of one indexed subspace.A modification of a standard index-based constrained skyline algorithm like[102] that additionally simply ignores the irrelevant attributes can be used forthe skyline computation. Secondly (the most complicated case), the queriedsubspace is not entirely covered by any of the indexed subspaces. As alreadynoticed in [56], the skyline of a set of dimensions cannot be computed fromthe skylines of the subsets of its dimensions. In this case, the constrained sky-line of a specified subspace is computed by merging the results of constrainednearest neighbor queries [50] that are running simultaneously on some of thelow-dimensional indexes in a synchronized fashion. In fact, we present an el-egant solution for merging the query results over the involved indexes.

The constrained subspace skyline computation over vertically partitioneddata has several technical challenges. An algorithm to merge the partial resultsinto the skyline result set is required. An important performance parameteris the pruning strategy. It is desirable to use multiple points to define pruningregions. Finally, the strategy that partitions vertically the data space amongseveral low-dimensional subspaces influences the overall performance. All ofthe above issues are addressed in the remaining of this chapter.

5.2 Constrained Subspace Skyline Computation forUniform Data

In this section, we develop a progressive algorithm for constrained subspaceskyline computation for the case where the data space is vertically distributedover multiple indexes. In order to employ low-dimensional indexing for con-strained subspace skyline queries, we partition vertically the data space amongseveral low-dimensional subspaces and index each of these subspaces using anR-tree. A constrained subspace skyline query is then partitioned among severalsub-queries, each of them is processed by utilizing the corresponding index.In order to answer constrained skyline queries on an arbitrary dimension set,it is necessary that every involved attribute belongs to at least one dimensionset. We restrict our analysis assuming that every attribute belongs to exactlyone dimension set. After the problem definition, we propose an effective prun-ing strategy, present our algorithm, and propose an advanced scheduler todetermine in which order to query the indexes.

5.2 Constrained Subspace Skyline Computation for Uniform Data 115

5.2.1 Preliminaries

Given a space S defined by a set of d dimensions {d1, d2, ..., dd} and a dataset P on S, a point p ∈ P can be represented as p = {p1, p2, ..., pd} whereevery pi is a value on dimension di. Each non-empty subset S′ of S (S′ ⊆ S) isreferred to as a subspace (or dimension set). A point p ∈ P is said to dominateanother point q ∈ P on subspace S′ if (1) on every dimension di ∈ S′, pi ≤ qi;and (2) on at least one dimension dj ∈ S′, pj < qj denoted as p ≺S′ q. Theskyline of a space S′ ⊆ S is a set of points P ′ ⊆ P which are not dominatedby any other point on space S′. That is, P ′ = {p ∈ P | 6 ∃q ∈ P : q ≺S′ p}. Thepoints in P ′ are called skyline points on space S′. Based on the definition ofskyline points on a subspace, we define the constrained subspace skyline.

Definition 5.1. (Constrained Subspace Skyline)Let C = {c1, c2, ..., ck} be a set of range constrains on subspace S′ = {d′1, d′2, ..., d′k}where k ≤ d. Each ci is expressed by [ci,min, ci,max], where ci,min ≤ ci,max,representing the minimum and maximum value of d′i. A constrained skylineof a space S′ ⊆ S is a set of points P ′c ⊆ P such as P ′c = {p ∈ Pc| 6 ∃q ∈ Pc :q ≺S′ p}, where Pc ⊆ P and Pc = {p ∈ P |∀di ∈ S′ : ci,min ≤ pi ≤ ci,max}.

For a point p ∈ Pc in the dimension set S′, the dominance region DRS′,C(p)is {q ∈ Pc|p ≺S′ q}. Similarly, the anti-dominance region ADRS′,C(p) of apoint p refers to the set of points dominating p. In Figure 5.2, the dominanceand anti-dominance regions of a point are illustrated. A straightforward ob-servation is that a point q falls within the dominance region of point p, if andonly if p falls within the anti-dominance region of point q.

d 2

d 1 O

p

C 1, min

C 1, max

C 2, min

C 2, max

DR of p

ADR of p

Constrained Region

Fig. 5.2. Dominance region of p

Let us assume that the dimension set S′ is vertically distributed overn dimension sets S′i (∪S′i = S′, 1 ≤ i ≤ n). Each dimension set S′i has acorresponding Ci (∪Ci = C, 1 ≤ i ≤ n). In the following, we present a lemmathat allows us to identify skyline points using the subspaces only.


We first assume that the distinct value condition holds [105], [131]. Thiscondition ensures that for any two data points p and q (p, q ∈ P ), if pi 6= qi

(∀i ∈ S), then for two sub dimension sets U and V , (U, V ⊆ S), whereU ⊂ V , SKYU (P ) ⊆ SKYV (P ). In other words this conditions says that thecontainment relationship between superset and subset skyline holds.

Lemma 5.2. A point p ∈ Pc is a skyline point in S′ if and only if there existsno point q ∈ Pc that belongs to the anti-dominance area of p for all dimensionsets S′i (1 ≤ i ≤ n).

Proof. Let q ∈ ADRSi,C(p) for each 1 ≤ i ≤ n. Since ∀di ∈ S′i, pj < qj

(1 ≤ j ≤ d, 1 ≤ i ≤ n) and ∪S′i = S′, it follows ∀di ∈ S′, pj < qj (1 ≤ j ≤ d).Thus, q is dominated by p in S′ and it is not a skyline point in S′.

This lemma gives us a powerful rule to determine whether a point p isdominated by a point q or not. Note that p is not dominated by q if and onlyif p’s projection is not dominated by q’s projection in at least one S′i. If thedominance test returns true in all subspaces, we immediately obtain that qdominates p. In the case where the distinct value condition does not hold,all objects that are equal to p in one of the n dimension sets S′i have to beexamined.

5.2.2 Pruning with the Nearest Neighbor

In this subsection we propose a pruning strategy using only one point forpruning. Based on Lemma 5.2, we can discard all points that fall in the dom-inance area of a point in every dimension set S′i without losing any skylinepoint. Any point may be used for pruning, but it is desirable that the chosenpoint prunes in all dimension sets S′i equally well and maximizes the totalnumber of discarded points. Moreover, the point should be computed fast inorder to apply pruning very early.

In our one-point pruning strategy, we suggest using the nearest neighborwith respect to the Manhattan distance to the lower left corner of the con-strained region of the dimension set S′. The reasons for choosing this point areas follows. Firstly, because it is a member of the skyline, there is no dominat-ing point. Secondly, among all the skyline points it is the one with maximumperimeter of the dominance region. Because perimeter and volume is highlycorrelated, this also gives a large volume on average, and hence, it is alsoexpected to prune a large percentage of the data points. Thirdly, our skylinealgorithm also guarantees that the nearest neighbor is computed first amongall skyline points.

In the following, we present the results of an empirical analysis of thesize of the dominance region assuming N data points being independentlyand uniformly distributed in the d-dimensional unit cube. In Figure 5.3, theexpected value of the volume is plotted as a function of the dimension. Asillustrated, we obtain large regions for small and medium-dimensional data,

5.2 Constrained Subspace Skyline Computation for Uniform Data 117

while the pruning effect suffers on high-dimensional data. Moreover, we haveobserved only a marginal increase of the volume compared to the maximumarea of a point.

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Dimensionality

Fig. 5.3. Pruned area as function of dimensions

In the following subsection we present our algorithm for constrained sub-space skyline computation.

5.2.3 STA: A Threshold-based Skyline Algorithm

Let us assume that the data space consists of the dimension set S, while each ofthe n indexes Ri corresponds to a dimension set Si (1 ≤ i ≤ n). A constrainedskyline query refers to a dimension set S′ under constrain C. The skyline queryis partitioned into n sub-queries, each of which refers to a dimension set S′i(S′i ⊆ Si) under constrain Ci, (Ci ⊆ C) and (∪Ci = C, 1 ≤ i ≤ n). The pointsare retrieved through incremental constrained nearest neighbor search on eachindex i based on the distance to the lower left corner of the constrained regionaccording to the dimension set S′i, from now mentioned as nearest neighborsearch. The retrieved points are merged into a skyline result set. Notice thatthe dimensionality of the indexed dimension set Si and that of the querieddimension subset S′i are not necessarily equal, thus during the nearest neighborsearch, irrelevant dimensions are ignored. Moreover, the points are deliveredin order of their Manhattan distance to the lower left corner of the constrainedregion according to the dimension set S′i. We first assume that the distinctvalue condition holds [105], [131]. Later, we relax this assumption.

Our algorithm consists of two steps: a filter step and a refinement step. Inthe filter step, the points in each dimension set S′i are accessed simultaneouslybased on the sorting according to dimension set S′i. Points outside the con-strained region are immediately discarded. In the refinement step, a potentialskyline point, also called a candidate, is tested for domination. The refinementstep needs to determine whether this candidate is a skyline point or not. Thisrequires a domination test based on the query dimension set S′. Because eachindex Ri return the points in a strictly monotonic order, a point p retrieved


from Ri could only be dominated by points that are retrieved from Ri beforep. Thus, we need to test a point for domination only with the skyline pointsthat have been already found and we can immediately return a point that isnot dominated as a skyline point. The dominance test can be performed in away similar to traditional window queries [102] using a main-memory R-treewhose dimensionality is equal to the query dimensionality.

While the points in each dimension set S′i are accessed simultaneously, ouralgorithm targets to find quickly the first nearest neighbor. Thus, we keep thepoint with the smallest distance as a potential nearest neighbor and we usethe Manhattan distance of the last reported point of S′i as a threshold to findthe nearest neighbor. This leads to an algorithm which is similar to Fagin’sthreshold algorithm (TA) [45] for list merging. As soon as the constrainednearest neighbor according to S′ is found, we can discard all points thatbelong to the dominance area of the nearest neighbor in each dimension setS′i (due to Lemma 5.2). In fact, the incremental constrained nearest neighborsearch is executed with respect to a pruning area. During the incrementalconstrained nearest neighbor search internal and leaf nodes are discarded ifthey are entirely covered by the pruning area.

An index scheduler determines the order in which the indexes are visited.A naive way for visiting the indexes is a round robin strategy. This howeverignores the fact that we strive for fast computation of the nearest neighbor,and that the indexes differ in their performance. Therefore, we are interestedin more advanced strategies resulting in a fast increase of the threshold. Theunderlying assumption of our advanced scheduling strategy of the indexes isthat an index that has been productive in the past will continue to be in thenear future. Therefore, we advance the processing with the index that willmaximize the benefit for increasing the threshold.

Algorithm 11 outlines the algorithm. Our algorithm abstracts the imple-mentation details on the organization of the multidimensional points. Thenaive approach is to keep in each of the low-dimensional indexes the entirepoints. However, this causes a high storage overhead. A more advanced ap-proach is to store only the associated attributes of the subspaces in the indexand to keep a copy of the entire point in a hash-table where the point canbe retrieved using its associated identifier. Each time a point is received fromone of the indexes, the entire record information is read from the hash-table.

5.2.4 Discussion

STA has some desirable properties which makes this threshold-based algo-rithm suitable for constrained subspace skyline computation especially foruniform data, as we have shown in this section. However, real datasets doesnot follow a uniform distribution and for this reason STA may fail to dis-card a large amount of data points. The strategy of using multiple indexes tocompute the constrained subspace skyline can be extended in order to benefit

5.3 Improved Pruning for Arbitrary Distributions 119

Algorithm 11 STA: A Threshold-based Skyline Algorithm1: Input: P ′c = {∅} denotes the constrained subspace skyline set2: dist ← maxDistance; dist represents the distance to current NN3: ind ← indexScheduler(); pick an index4: while (1 ≤ ind ≤ n) do5: o ← Rind.nextNN(S′ind, Cind);6: dind ← distance(o, S′ind, Cind);7: if ((o 6∈ P ′c) && (notP ′c.dominated(o)) then8: P ′c.insertanddiscard(o); //discard dominated points9: if ((distance(o, S′, C) < dist) then

10: nnCandidate ← o; //discard dominated points11: dist ← distance(nnCandidate, S

′, C);12: end if13: if (dist < dist1 + ... + distn) then14: for (1 ≤ i ≤ n) do15: Ri.pruning area ← DRS′,C(nnCandidate);16: end for17: ind ← indexScheduler();18: end if19: end if20: end while21: return P ′c

from real datasets, by following the methodology described next. Simply, in-stead to deliver a globally ranked list of nearest neighbors we may exploit thefact that local nearest neighbors are sufficient to postulate about discardingan area or not. Therefore, our framework is suitable also for the case of notuniformity and can be extended to arbitrary distributions.

5.3 Improved Pruning for Arbitrary Distributions

The main disadvantage of the previous pruning strategy is that only onepoint is used for pruning. In the previous section we showed that the nearestneighbor is expected to prune a high percentage of a uniformly distributeddataset, but for real datasets, the pruning power of the nearest neighbor maydiminish. In this section we propose a pruning strategy that allows us toenlarge the dominance area using multiple points for pruning.

5.3.1 Motivating Example

We begin with the following motivating example to show the limited applica-bility of the nearest neighbor pruning strategy. Let us consider a non-uniformdataset consisting of clusters as it is illustrated in Figure 5.4 where two clustersexist. As illustrated, only roughly half of the points belongs in the dominance


area of the first nearest neighbor p11. As a consequence, all remaining points

need to be retrieved and tested for domination.

p 2

p 1

y

x O

DR x (p

1 )

DR x (p 2 )

D R

y ( p

2 ) D

R y

( p 1 )

Fig. 5.4. Simultaneous pruning

Our goal is therefore to prune the data space using multiple points. Letus generalize the algorithm presented in Section 5.2 to deal with this prob-lem. Considering Figure 5.4, after retrieving the first nearest neighbor p1 thepruning area in both projections DRx(p1) and DRy(p1) is discarded and wecontinue expanding the local indexes until we find the second nearest neighborp2 (assuming that it is not dominated by the first nearest neighbor). Whentrying to prune in both projections with the second nearest neighbor, we mayfalsely discard skyline points located in the lower-left rectangle (shown withdashed lines). While it seems to be easy to determine the region within sky-line points may be falsely discarded, for a higher query dimensionality and alarger number of dimension sets, this region enlarges and is difficult to define.As a consequence, we are not able to prune simultaneously in both subspacesusing the same point.

To overcome this problem, we make the following observation: when pointslying in the dominance region of a point are not discarded in at least onesubspace, then we are able, under certain conditions, to discard points in allremaining subspaces, while we guarantee no false dismissals. Therefore, weuse the points that are retrieved as local constrained nearest neighbors froman index, for pruning in all other indexes. Local nearest neighbor points thatlay outside the constrained region in the query space are discarded and notused for pruning. In contrast to the nearest neighbor strategy of Section 5.2,this new pruning strategy also provides the advantage of discarding pointsalready before the first nearest neighbor is found.

1 Note that we use lower indexes for differentiating points and attributes inter-changeably as long as the context allows recognizing the difference.


For sake of simplicity, from now on we assume that the query space is thedata space and the query refers to the whole dataset. It is straightforwardto adapt the strategy to constrained subspace skyline queries by ignoringthe irrelevant dimensions and discarding points that do not belong to theconstrained region.

Let us now consider the example in Figure 5.5, where a 4-dimensional dataspace is divided into two 2-dimensional subspaces (partitions) S1 = {d1, d2}and S2 = {d3, d4}. When the point p1 is retrieved from subspace S1 then thedominance area of the point p1 in subspace S2 is used for pruning.

p 2

p 1

d 2

d 1

p 3

d 4

d 3

p 1

Fig. 5.5. Pruning strategy

Unfortunately, by following this strategy some skyline points are falselydiscarded. In Figure 5.6, we give an example. Let the point q in the projectionS2 collapse on the point q1. The point p, based on Lemma 5.2, is not a skylinepoint in S, since it is dominated by q in all dimensions sets of S. On the otherhand, if the point q in the projection S2 collapses on the point q2, then pointp may be discarded falsely, since it is a potential skyline point.

d 2

d 1

d 4

d 3

p 2

p 1

q

p

p

p 3

q 1

q 2

Fig. 5.6. Example of Pruning Strategy

In the next subsection we investigate this case in more detail by providinga lemma which helps us to ensure that we do not discard points falsely.


5.3.2 Improved Pruning using Multiple Points

We now present a pruning strategy that ensures that the entire skyline set isreturned and no point is falsely discarded. Let us continue with our runningexample with two 2-dimensional subspaces. Each time a new point is retrievedfrom subspace S1 this point is a candidate for pruning in subspace S2.

In the following, let DR(Si) be the discarded region of a dimension set S1.A point belongs to the discarded region if points that lay in its dominancearea are discarded. Let us assume that the discarded region of S1 consists ofthe dominance areas of the projection of the points q1, q2, ..., qm. To discardpoints from the dominance area of p in S2, the point p and every point qi mustbe dominated by the projection of the same point in S2 and S1 respectively.This condition must hold for each point qi which belongs in the discarded areaof S1.

Since the complexity of this test is high, and in order to keep the over-head low, we restrict the search for a dominance point into the following twoconditions (assuming n = 2):

1. the projection of point qi belongs to the anti-dominance area of p in S2,or

2. the projection of point p belongs to the anti-dominance area of qi in S1.

If one of the two conditions holds for each point qi the dominance area ofthe projection of point p in S2 can be discarded. In Lemma 5.3, we generalizethese conditions for arbitrary n.

We illustrate these conditions by an example depicted in Figure 5.7. Again,the data space S is divided into two subspaces, S1 and S2. The left columncontains projections of points in subspace S1, whereas the right column con-tains projected points in S2. Points are retrieved using local nearest neighborsearch. The first point p1 is found in projection S1. Since no point is used forpruning the dominance area of p1 in projection S2 can be discarded. In thesecond step (assuming round-robin retrieval), the first point p2 in projectionS2 is retrieved. The anti-dominance area of p1 in subspace S2 contains pointp2 which means that p2 dominates p1 in S2. It is not possible that a pointthat belongs to both dominated regions, for example point q, to be a skylinepoint since it is dominated by p1 and p2 in S.

After this, by retrieving the next point p3 in projection S1, the anti-dominance area of p2 does not contain p3 in S1, but the anti-dominance areaof p3 contains point p2 in projection S2. This guarantees that a point which ispruned by both discarded areas is dominated based on Lemma 5.2. The fol-lowing Lemma is necessary to generalize the previous strategy for n dimensionsets.

Lemma 5.3. Let p(Si) be the projection of point p in subspace Si. No pointis falsely discarded, if for each tuple (u1(S1), u2(S2), ..., un(Sn)) ∈ DR(S1)x DR(S2) x ... x DR(Sn) there exists a k (1 ≤ k ≤ n) such that ∀i 6= kuk(Si) ∈ ADRSi(ui(Si)).


p 1

d 2

d 1

d 4

d 3

p 1

p 1

p 2

d 2

d 1

d 4

d 3

d 2

d 1

p 3

d 4

d 3

p 1

q

q

p 1

p 2

p 1

p 2

p 3

Fig. 5.7. Improved pruning heuristic

Proof. Let p be a point that is discarded by the pruning strategy, then p mustbelong to ADRSi(ui(Si)) for some point ui(Si) ∈ DR(Si) for each dimensionset Si. Since for each tuple of DR(S1) x DR(S2) x ... x DR(Sn) there existsa k (1 ≤ k ≤ n) such that i 6= k uk(Si) ADRSi(ui(Si)), based on Lemma 5.2the point p is dominated in S by the point uk and is not a skyline point.

Generalizing the previous strategy, each time a point p should be addedto the discarded area of Sk, the condition must hold for all tuples (u1(S1),..., uk−1(Sk−1), p, uk+1(Sk+1), ..., un(Sn)). For the efficient implementationof the pruning mechanism, n main memory R-trees are used, each of themwith a dimensionality of the corresponding Si, and keeping the points thatparticipate to the discarded area of Si. While the visited points according tothe partial sorting imply the examined region, these main-memory indexesimplement the discarded region.

In order to confirm the accuracy of the multiple points pruning strategy,we study the performance of our STA algorithm using real world datasets(see Section 5.5). We examine constrained subspace skyline queries of lower


dimensionality. Each workload contains 100 queries that request the skylinesof 100 random subspaces with fixed dimensionality dsub = 3. We set the con-strained region to cover 60% of each (queried) axis. We report the number ofpage accesses required for calculating the skyline.

Fig. 5.8. Pruning Strategies

Figure 5.8 reports the effect of the different pruning strategies. As ex-pected, the improved pruning strategy using multiple points for pruning issuperior to the strategy which uses only one point for pruning except for theNBA dataset. However, the one-point pruning strategy works very well evenfor real datasets.

5.3.3 Scheduler Adaptation

Finally we adapt our scheduler for the case of the improved pruning strategy.In the case of the nearest neighbor pruning strategy, it is important to speed upthe merging phase and find quickly the first nearest neighbor. In the improvedpruning strategy it is more important to prune equally large regions over allindexes. When an index delivers the next point, this point may be used forpruning in all indexes except the index that delivered this point. In the casewhere an index has a high number of pruning points, while another index hasonly few, the first index performs very well, in contrast to the other that willperform very poorly. Thus, we give the control to this index that has the leastpruning points, so that the other indexes have the opportunity to enlarge theirdiscarded regions.

5.4 Adaptive Subspace Indexing using Low DimensionalR-trees

In this section we present an algorithm to distribute the data space over mul-tiple indexes. This method is very effective, especially in applications which

5.4 Adaptive Subspace Indexing using Low Dimensional R-trees 125

involve constrained subspace skyline retrieval, where a multi-dimensional in-dex over the whole data space shows poor performance. As shown in [102],constrained skyline queries, whose constraint regions cover more than a spe-cific portion of the data space, are usually more expensive than regular sky-lines. It has been shown in [102] that even a range very close to 100% is muchmore expensive than a conventional skyline. This effect is even more crucialwhen subspace skyline queries are applied. In case of subspace skyline queriesadditional performance degradation is caused through the random groupingeffect (which is explained later in this section). Even for applications thatare restricted to constrained subspace skyline computation, the whole datasetmust be indexed. It is not possible to pre-calculate the points of the full spaceskyline and their duplicates, since the result depends on the given constrains.

In the following section we provide a formula for determining the appro-priate number of indexes (subsection 5.4.1). A greedy algorithm which assignsthe attributes to the indexes is developed to find the most beneficial dimensionsets for skyline calculation (subsection 5.4.2). Finally, we extend our approachso that it adapts to the query workload (subsection 5.4.3).

5.4.1 Number of Indexes

In this subsection we present a formula for determining the number of indexes.Toward this goal we investigate the random grouping effect of a multidimen-sional index. Our goal is to make clear why a high-dimensional index is notappropriate for subspace skyline computation and how the low-dimensionalindexes could avoid performance degradation.

The performance of low-dimensional constrained skyline queries decreaseswhen the dimensionality of the indexed space is high in contrast to the queryspace that is low. This is mostly caused by the random grouping effect. Sincenot all dimensions are used for splitting the axes during the index creationfor a leaf node, when a query that requires projection is posed to the indexthe performance of the index corresponds to a random low-dimensional in-dex, i.e. an index that groups the points into leaf nodes in a mostly randommanner. As an example consider a 10-d data space and assume that we areinterested in retrieving the skyline of any 2-d subspace. If only two dimensionsare used for splitting, then the probability that the chosen dimensions havebeen used for splitting is very small. Thus, the query performance is similar tothe performance of a 2-d index, where the data points were grouped togetherrandomly.

We are now modeling this effect in a more formal way. We assume unifor-mity and independence in the data distribution. If every leaf node is splittedat least once in each dimension, we need a total number of at least 2d leafnodes. Given the effective data page capacity Ceff of the R-tree, which de-pends on the index dimensionality, the number of leaf nodes can be estimatedas:


L = d N

Ceffe (5.1)

We define an index as well performing if L ≥ 2d, which means that everyleaf node is splitted by each dimension once.

The maximum dimensionality for which a low-dimensional index fulfils theabove property is:

dmax = argmin1≤i≤d(|1− L

2i|) (5.2)

The resulting formula for the number of indexes is:

n = d d

dmaxe (5.3)

To explain how we determine the number of indexes we use the aboveformula for the 32-dimensional Color histogram dataset (see Section 5.5),which consists of 68,040 records corresponding to photo images. In Figure5.9 we plot for different values of d the expected random grouping effect,expressed through the formula |1 − L/2d|. The figure shows only the valuesuntil the minimum. Assuming a page size of 4K, the maximum dimensionalitydmax for which a low-dimensional index is well-performing for this dataset is11.

Fig. 5.9. Random Grouping Effect

Our formula suggests that we should create 3 indexes for this dataset.In this way we index more effectively high dimensional datasets, by avoidingperformance degradation due to random grouping effect.

After we have shown how we determine the number of indexes, in the nextsubsection we provide an algorithm to distribute the attributes to the indexes.

5.4 Adaptive Subspace Indexing using Low Dimensional R-trees 127

5.4.2 Attribute Assignment Algorithm

Our goal is to build well performing indexes for answering constrained sub-space skyline queries, thus we propose a strategy to distribute the attributesover the n indexes, where n corresponds to the number of indexes defined inthe previous section. We are interested in estimating the compared cost be-tween two subspaces. We explore how the number of distinct values influencesthe performance of STA.

During the threshold based skyline calculation, each low-dimensional in-dex returns some local skyline points that might be included in the finalresult set. Independent from the pruning strategy, it is impossible to prunethe local skyline points, and these points have to be examined in the query-dimensional space. There are two different reasons why a point might be laterdiscarded. First, there may be many points whose projections coincide to alow-dimensional point, so that it is dominated by some duplicate point in thequery-dimensional space. Second, in the case of constrained subspace skylinequeries, a skyline point might be discarded because it lies out of the con-strained region in the query-dimensional space.

Since the test of discarding points caused by the constrained region has alow computational cost, we focus on the first case. In this case, a dominationtest for all those points is required in a refinement step, even though it is verypossible that only few of those are skylines in the query-dimensional space.This effect is influenced by how many distinct values each dimension has.

Thus, as a quality measure of a subspace Si we propose the number ofdistinct values dvSi . Methods to estimate the distinct values can be foundin [63]. We propose a greedy algorithm that targets to restrict the randomgrouping effect, while maximizing the number of distinct values.

Let n be the estimated number of indexes, based on the random groupingeffect. Our greedy algorithm works as follows: Given n dimension sets Si anda set of unassigned dimensions, the dimension set with the smallest numberof distinct values is picked and the unassigned dimension that maximizes thenumber of distinct values is added to this.

At an initialization step, the number of distinct values for each dimensionseparately is calculated and the n dimensions with the highest distinct valuesare assigned to the n dimension sets Si. At a next step, we pick the dimensionset Sh with the minimum number of distinct values. Thereafter we calculatethe cost for each remaining dimension di merged with the picked dimensionset. The dimension contributing to the dimension set with the highest distinctvalues is chosen. Following the above methodology, we add a new dimensionto a dimension set. The greedy algorithm outputs a dimension set and anindex is created on this as soon as the dimensionality of this dimension set isequal to the maximum dimensionality dmax. If there are no other dimensionsleft, the algorithm terminates and outputs the rest of dimension sets and anindex is created on each of those even though the dimensionality is not equal


to the maximum dimensionality. Our algorithm is illustrated in Algorithm 12,where the function dim returns the dimensionality of a subspace.

Algorithm 12 Attribute Assignment Algorithm1: Input: D = {d1, d2, ..., dd}2: Input: S = {S1, S2, ..., Sn}3: while (S is not empty) do4: Smin ← argminSi∈S {dvSi};5: if ((dim(dm) != dmax) && (D is not empty) then6: max ← argmaxd∈D {dvSmin∪d};7: Smin = dm ∪ Smin;8: end if9: output dm

10: end while

The algorithm finishes with n dimension sets, each of those containing oneor more dimensions. Note that, each index receives at most dmax attributes.At the same time our algorithm tries to restrict the random grouping effect,while maximizing the number of distinct values. The random grouping effectdecreases, with an increasing index dimensionality and as a consequence ourassignment algorithm strives to find a balance between the number of theindexes and the dimensionality of each index.

Our greedy algorithm tries to choose the dimension sets which are morebeneficial for a threshold-based skyline algorithm, by minimizing the pointsthat have to be examined. In the next sub-section we modify the greedyalgorithm to take advantage of the query workload.

5.4.3 Workload-adaptive Extension

The subspaces are unlikely to have the same probability of being requestedin a query. Rather, we might be able to associate some probability with eachsubspace, representing the frequency with which it is queried. The fact thatuser preferences are correlated strengthens our approach. In such a case itis even more important to use multiple indexes, which are built on the mostpreferred subspaces.

A simple, but very powerful extension over our basic model is to weightthe cost estimation of each dimension set by its probability. This is a naturalway to express the fact that a dimension set influences the query performanceonly if it is requested.

This method for creating indexes that adapt to the underlying query work-load is very important to indexing the most common queries and as a resultto guarantee an overall low query response time. This extension allows us toexamine the performance of our algorithms under a workload, which is closerto real applications, instead of picking random subspaces.

5.5 Performance Evaluation 129

5.5 Performance Evaluation

In this section we experientially confirm that our STA algorithm is both scal-able and superior to other techniques. For the experimental evaluation of ourapproach, three data sets from real-world applications are used. Specifically,NBA dataset contains 17,000 13-dimensional points, where each point corre-sponds to the statistics of a player in 13 categories. Color moments contains9-dimensional features of 68,040 photo images extracted from the Corel Drawdatabase. Color histogram consists of 32-dimensional features, representingthe histogram of an image. Additionally, we generated 10-dimensional uni-form datasets with varied cardinality. In the first set of experiments we studythe efficiency of the proposed algorithm on constrained subspace skyline com-putation. In the second set of experiments, we examine the influence of thedimensionality of the dataset to our algorithm and study the scalability. In afinal set of experiments, we investigate the special case in which we request theconstrained skyline of a subspace based on a more realistic query workload.In particular, the queries follow the ”80-20” rule.

In terms of implementation, we set the page size for each R-tree to 4Kwhile each dimension is represented by a real number and we keep the entirepoint in each low-dimensional index. All datasets are indexed as suggested byour greedy algorithm.

5.5.1 Examination of Constrained Subspace Skylines

In this section we compare our technique with BBS algorithm [102] whichis the best known method for R-tree based skyline retrieval. We index eachdataset through a full-space R-tree and a modification of the BBS algorithmthat simply ignores the irrelevant attributes is used for constrained subspaceskyline computation. We used all real and synthetic datasets, namely NBA,Color moments, Color histogram and Uniform datasets for the following ex-periments.

Effect of Constrained Region

We now study the effect of the constrained region. More particularly we varythe constrained region from 50% to 100% of each dimension. We examinesubspaces with dimensionality of 3. For the purpose of this experiment weuse the uniform dataset which has a full space dimensionality of 10-d and acardinality of 50,000 points.

In Figure 5.10, the x-axis represents the constrained region whereas they-axis measures the page accesses required to calculate the subspace skylinewithin the constrained region. We compare our STA algorithm against BBS.As we can observe from Figure 5.10, our algorithm is consistently superior.Note that, a range close to 100% is much more expensive than a conven-tional subspace skyline. However, for constrained subspace skyline queries


0

200

400

600

800

1000

1200

1400

1600

50 60 70 80 90 100

Constrained Region (% of each axis) Pa

ge A

cces

ses

STA BBS

Fig. 5.10. Uniform dataset

our algorithm is generally more effective requiring less page accesses thanBBS for arbitrary large constrained regions. A crucial observation is that theperformance of our algorithm is not affected significantly by the size of theconstrained region.

From now on the size of the constrained region is set for each querieddimension to 60% and 80% for the uniform and the real datasets respectively.

Effect of Subspace Dimensionality

In this section we compare our algorithm with BBS on constrained subspaceskyline computation with respect to the subspace dimensionality. For the firstexperiment of this section we used the 10-dimensional Uniform dataset whichhas a cardinality of 50,000 whereas for the second experiment we used the9-dimensional Color moments dataset which has a cardinality of 68,040. Wecompare our STA algorithm with BBS, again by varying the query subspacedimensionality from 2 to 4.

0

200

400

600

800

1000

1200

1400

1600

2 3 4

Subspace Dimensionality

Page

Acc

esse

s

STA BBS

(a) Uniform Datasets

0

200

400

600

800

1000

1200

2 3 4 Subspace Dimensionality

Page

Acc

esse

s

STA BBS

(b) Color Datasets

Fig. 5.11. Uniform and Color Datasets.

5.5 Performance Evaluation 131

The left part of Figure 5.11 captures the performance of STA and BBSby reporting the measured page accesses for different subspaces with varieddimensionality for the Uniform dataset. Our algorithm outperforms BBS forthis dataset.

The right part of Figure 5.11 reports the number of page accesses fordifferent subspaces with varied dimensionality for the Color moment datasetand using as a query workload random subspaces of fixed dimensionality,namely 2, 3 and 4. These results demonstrate that the STA algorithm leadsto substantially less page accesses than BBS.

5.5.2 Scalability with the Dataset Cardinality and Full-spaceDimensionality

In this subsection we examine the scalability of our STA algorithm with re-spect to the dataset cardinality and the full-space dimensionality.

Effect of Dataset Cardinality

In order to study the effect of cardinality we use uniform datasets, which havea dimensionality of 10-d, and vary the cardinality between 10,000 and 100,000points. In addition we request the skyline of 3-dimensional subspaces. Figure5.12 shows the number of node accesses with respect to the dimensionality forthe Uniform dataset.

0

500

1000

1500

2000

2500

3000

10000 40000 70000 100000 Dataset Cardinality

Page

Acc

esse

s

STA BBS

Fig. 5.12. Uniform dataset

STA clearly outperforms BBS and the difference increases as the cardinal-ity increases. The proposed method scale better with cardinality than BBS.

Effect of Full-space Dimensionality

To examine the influence of the full-space dimensionality we use the Uniformdatasets. Each dataset consists of 50,000 points with a varied full space di-mensionality of 10, 20 and 30 respectively. We set the subspace dimensionalityconstant, namely dsub = 3.


0 500

1000 1500 2000 2500 3000 3500 4000 4500

10 20 30 Dataset Dimensionality

Page

Acc

esse

s

STA BBS

(a) Uniform Datasets (b) Real Datasets

Fig. 5.13. Uniform and Real Datasets.

The chart in Figure 5.13 compares the performance of STA and BBS re-porting the measured page accesses needed for calculating the skyline by vary-ing the full space dimensionality for the Uniform dataset. It is clear that BBSperforms poorly, whereas STA needs less page accesses for processing. Thedegradation of BBS is mainly due to the poor performance of R-trees in highdimensions and the random grouping effect, which is mentioned earlier. Ourtechnique avoids these shortcomings by using indexes of lower dimensionality.Note that these factors also influence STA, but their effect is limited comparedto the inherent deficiencies of the BBS algorithm. In the right side of Figure5.13 we report the performance of the STA algorithm relative to BBS for allreal datasets. The chart shows that our algorithm is consistently superior. Inboth cases our algorithm constantly outperforms BBS in this experiment.

5.5.3 Adaptation to the Query Workload

In this section we examine the performance of our algorithm for a queryworkload which is close to real applications. More specifically, we used the”80-20” rule to generate the queries. This means that 20% of the attributescontribute to 80% of the queries.

Figure 5.14 shows two plots where the left chart illustrates the performanceof the algorithms in terms of page accesses. The right plot captures the CPUoverhead. In both cases our algorithm constantly outperforms BBS in thisexperiment. Our algorithm constantly outperforms the other techniques inthis experiment.

Figure 5.15 shows the page accesses while the cardinality of the uniformdataset varies. More particularly, in case of non-random queries the differenceincreases quicker than for random queries with an increasing cardinality.

5.6 Conclusion

In this chapter, we proposed the constrained subspace skyline query as ameaningful generalization of subspace skylines. The proposed threshold-based

5.6 Conclusion 133

0

200

400

600

800

1000

1200

2 3 4 Subspace dimensionality

Pag

e ac

cess

es

STA BBS

(a) I/O cost

0

2000

4000

6000

8000

10000

12000

2 3 4 Subspace dimensionality

CP

U T

ime

(mse

c)

STA BBS

(b) CPU time

Fig. 5.14. Uniform Dataset.

0

500

1000

1500

2000

2500

3000

10000 30000 50000 70000 90000 Dataset Cardinality

Page

Acc

esse

s

STA BBS

Fig. 5.15. Scalability

skyline algorithm STA, is able of retrieving more efficiently skyline points inarbitrary dimensions sets by exploiting multiple low-dimensional indexes in-stead of relying on a single high-dimensional. Therefore, we are able to reuseexisting low-dimensional indexing of general purpose, also employed for otherqueries. Under extensive performance evaluation, we showed the superiorityof our proposed technique (STA) against related work and that our methodis robust and efficient when the dimensionality of the dataset grows. We pro-pose different pruning strategies to identify dominated regions. A strategy fordetermining the number of indexes and the assignment of dimensions to theindexes is presented. This strategy adapts to the underlying query workload.

Extensive performance evaluation shows the superiority of our proposedtechnique compared to the original BBS algorithm.

6

Efficient Computation of Reverse SkylineQueries

In this chapter, we study the problem of Reverse Skyline Queries. At first, weconsider for a multidimensional data set P the problem of dynamic skylinequeries according to a query point q. This kind of dynamic skyline correspondsto the skyline of a transformed data space where point q becomes the originand all points of P are represented by their distance vector to q. The reverseskyline query returns the objects whose dynamic skyline contains the queryobject q. It is the complimentary problem to that of finding the dynamicskyline of a query object.

The reverse skyline computation over multidimensional datasets has sev-eral technical challenges. We approach the problem of processing RSQs byanalytically exploiting the geometric properties of the problem and solutionspaces. Based on the concept of the global skyline of a point, we are able ofretrieving a small subset of the database as candidate reverse skyline points.We prove that all reverse skyline points with respect to q, belongs to the globalskyline.

In order to compute the reverse skyline of an arbitrary query point, wefirst propose a Branch and Bound algorithm (called BBRS), which is an im-proved customization of the original BBS algorithm. Furthermore, we identifya super set of the reverse skyline that is used to bound the search space whilecomputing the reverse skyline. To further reduce the computational cost ofdetermining if a point belongs to the reverse skyline, we propose an enhancedalgorithm (called RSSA) that is based on accurate pre-computed approxima-tions of the skylines. These approximations are used to identify whether apoint belongs to the reverse skyline or not. Through extensive experimentswith both real-world and synthetic datasets, we show that our algorithms canefficiently support reverse skyline queries. Our enhanced approach improvesreverse skyline processing by up to an order of magnitude compared to thealgorithm without the usage of pre-computed approximations.

The rest of the chapter is organized as follows. In Section 6.1 we moti-vate the need for reverse skyline queries. Section 6.2 reviews the previouswork related to skyline query processing and reverse NN search. Section 6.3

136 6 Efficient Computation of Reverse Skyline Queries

presents the problem definition, and Section 6.4 proposes an efficient Branchand Bound Reverse Skyline algorithm. In Section 6.5 the RSSA algorithmwhich is based on pre-computed skyline approximations is proposed. Section6.6 deals with the problem of how to compute accurate approximations. Sec-tion 6.7 contains an extensive experimental evaluation that demonstrates theefficiency of the proposed algorithms. Finally, Section 6.8 concludes the chap-ter.

6.1 Introduction

Given a set P of d-dimensional points, the skyline operator returns all pointsin P that are not dominated by another point. A point pi dominates anotherpoint pj if the coordinate of pi in each dimension is not greater than that ofpj , and strictly smaller in at least one dimension. The set of skyline pointspresents a scale-free choice of data points worthy for further considerationsin many application contexts. Informally, the skyline of a multidimensionaldata set contains the ”best” tuples according to any preference function thatis monotone in each dimension.

Skyline queries [26] are a specific and relevant example of preference queries[33], [84] and have been recognized as a useful and practical way to makedatabase systems more flexible in supporting user requirements [32]. The sky-line queries can be either absolute or relative, where ’absolute’ means thatminimization is based on the static attribute values of data points in P , while’relative’ means that the minimization of the coordinate-wise distances be-tween the data points and a user-given query point is taken into account.Consider for example a used car database with a table CarDB in which thecar instances are stored as tuples with attributes Make, Model, Year, Priceand Mileage. Obviously, the ideal case is to give the user the opportunity tospecify a car of his choice and to request the skyline according to this car.The skyline query becomes a dynamic [101] or relative [43] skyline query,where the domination is defined with respect to the users query point. Theskyline is then computed with respect to a new transformed data space wherethe query point becomes the origin and all points are represented by theircoordinate-wise distances to the query point. The dynamic skyline containsall those (transformed) points that are not dominated by any other point withrespect to the distances to the query point.

While there are many papers on skyline query processing from the userperspective (selecting the products they like), in this paper, we focus on thecompanies perspective. To explain, assume that the preferences of users aboutcars are stored as tuples in a relation with the same attributes as in CarDB.Figure 6.1 shows the preferences of users as points in a two-dimensional space,considering only attributes Price and Mileage. A car dealer, who wants todetermine the effectiveness of a particular car advertisement in the market,specifies a hypothetical car as a query point, say q. The dealer is interested

6.1 Introduction 137

in which customers find this car interesting, i.e., have this hypothetical caras part of their two-dimensional dynamic skyline. In Figure 6.1(a), a userinterested in cars of type p2, would have been interested in the hypotheticalcar q as q is in the dynamic skyline of p2. The same holds for preferences p4

and p6, both are in the reverse skyline of q as the dynamic skylines of p4 andp6 contains q. Based on the result of the reversed skyline, the car dealer mightoffer car q to customers with preferences p2, p4 or p6. If the result set is toosmall, the dealer might change the price of the hypothetical car to increasethe number of reversed skyline points.

m i l

e a

g e

price O

p 6

p 5

p 7

p 2

p 1

p 3

p 4

p 8

q

p' 5

(a) Dynamic Skyline

O

p 4

q p 2

p 6

m i l

e a g

e

price

(b) Reverse Skyline

Fig. 6.1. Example of Reverse Skyline

The query retrieving the set of most preferred cars in our motivating ex-ample belongs to a broader novel type of Reverse Skyline Queries (RSQ). Thegoal of a reverse skyline query is to identify the influence of a query objecton a multidimensional dataset with respect to a vector of distances. This is incontrast to the known problem of reverse nearest neighbors [86] where a singledistance function is applied to the multidimensional data. This is however toorestrictive for many applications that deal with multidimensional data as inour example above. In general, there is no meaningful and generally applica-ble distance function that maps a multidimensional point into a single value.Instead, we examine the ’influence’ of a data point with respect to dynami-cally dominated points. Intuitively, given a point q, the points which have qas a member of their dynamic skyline reflect the notion of ’influence’. We callthese points the reverse skyline of q.

Besides preference-based marketing, RSQ is highly relevant for many otherapplications. As another example, let us consider business location planningof a new store. Assuming several possible choices for the new store location,the strategy is to select the location that maximize the number of customers.A reverse skyline query posed on a customer database would return thosecustomers who are potentially interested in the new store.


6.2 Related Work

Although the proposed algorithms can be used with various indexes, we as-sume that the dataset is indexed by an R-tree due to the popularity of thisstructure. Section 6.2.1 briefly overviews skyline query processing using theR-tree and algorithms for dynamic (or relative) skyline computation. Section6.2.2 describes previous work on (monochromatic) reverse nearest neighborsearch. Note that the problem of reverse skyline has not been addressed sofar.

6.2.1 Skyline Query Processing

Skyline query processing has received considerable attention in multidimen-sional databases and has been studied extensively in recent years [105], [131][119]. Since the introduction of the skyline operator [26], several efficient al-gorithms have been proposed for the ’general’ skyline query [101]. The bestknown algorithm that can answer Dynamic Skyline Queries (DSQ) is theBranch-and-Bound Skyline (BBS) algorithm [101]. BBS is a progressive op-timal algorithm for the conventional skyline query. In the setting of BBS, adynamic skyline query specifies a new n-dimensional space based on the origi-nal d-dimensional data space. First, each point p in the database is mapped tothe point p′ = (f1(p), ..., fn(p)) where each fi is a function of the coordinatesof p. Then, the query returns the general (i.e., static) skyline of the new space(the corresponding points in the original space).

q

x

y

O

p 6

p 5

p 7

p 8

p 2

p 1 p 3

p 4

N 4

N 3

N 1 N 2

N 5

N 6

(a) Planar Representation

p 1 p

2 p 3 p 4 p 5 p 6 p 7 p 8

N 1 N 2 N 3 N 4

N 5

N 6

N 1 N 2

(b) R-tree Nodes

Fig. 6.2. Illustration of the BBS Algorithm.

Figure 6.2 shows the R-tree for the dataset of Figure 6.1, together withthe minimum bounding rectangles (MBR) of the nodes. In order to processthe dynamic skyline query according to the query point q (shown in Figure6.2), BBS processes the (leaf/intermediate) entries in ascending order of theirmindist (which is computed on-the-fly in the dynamic space when the entry

6.2 Related Work 139

is considered for the first time) to the query reference point. At the beginning,the root entries are inserted into a heap H (= {N6, N5}) using their mindistas the sorting key. Then, the algorithm removes the top N6 of H, accesses itschild node, and enheaps all the entries there (H now becomes {N5, N2, N1}).Similarly, the next node visited is N5, where H (= {N4, N2, N1, N3}). Thenext node visited is leaf N4, where the data points are also added to H(= {p8, N2, N1, N3, p7}). Since p8 tops H, it is taken as the first dynamicskyline point, and used for pruning in the subsequent execution. The algorithmproceeds in the same manner until the heap becomes empty.

In analogy to skyline query processing for only one query reference point(absolute or relative), some recent studies considers several query points at thesame time. Given a set of data points P and a set of query points Q, a SpatialSkyline Query (SSQ) [112] retrieves those points of P which are not domi-nated by any other point in P considering their derived spatial attributes. Themain difference with the regular skyline query is that this spatial dominationdepends on the location of the query points Q. The spatial skyline query canbe defined as a special case of the dynamic skyline query [101]. The authorsin [112] proposed two algorithms for the spatial skyline query, the R-tree-based B2S2 and the Voronoi-based V S2. The spatial skyline query is knownalso as Multi-Source Skyline Query (MSSQ) [43]. The authors in [43] extendsmulti-source skyline queries to road networks where the network distance be-tween two locations needs to be computed on-the-fly. Several algorithms forprocessing MSSQs are proposed and evaluated in this paper.

Li et. al. in [92] define a novel data cube called Data Cube for DominantRelationship Analysis (DADA), which captures the dominant relationshipsbetween products and customers, to help firms delineate market opportuni-ties based on customer preferences and competitive products. The concept ofdominance is extended for business analysis from a microeconomic perspec-tive. More specifically, a new form of analysis is proposed, called DominantRelationship Analysis (DRA), which aims to provide insight into the domi-nant relationships between products and potential buyers. Three new classesof skyline queries called Dominant Relationship Queries (DRQs) are conse-quently proposed for analysis purposes. These types of queries are: (1) Lin-ear Optimization Queries (LOQ), (2) Subspace Analysis Queries (SAQ), and(3) Comparative Dominant Queries (CDQ). Efficient computation for suchqueries is achieved through a novel data structure.

6.2.2 Reverse NN Search

Given a point q, a reverse nearest neighbor (RNN) query retrieves all the datapoints that have q as one of their nearest neighbors. Besides this monochro-matic version, there exist also the bichromatic RNN [115] where, given a setQ of queries, the goal is to find the objects p ∈ P that are closer to someq ∈ Q than any other point of Q. The algorithms for RNN processing can be


classified in two categories depending on whether they require preprocessing,or not: the hypersphere-approaches and the Voronoi-approaches.

The original RNN method [86] pre-computes for each data point p its near-est neighbor NN(p). Then, it represents p as a vicinity circle (p, dist(p, NN(p)))centered at p with radius equal to the Euclidean distance between p and itsNN. The MBRs of all circles are indexed by an R-tree, called the RNN-tree.Using the RNN-tree, the reverse nearest neighbors of q can be efficiently re-trieved by a point location query, which returns all circles that contain q.Because the RNN-tree is optimized for RNN, but not NN search, the authorsin [86] propose to use two trees: (1) a traditional R-tree-like structure for near-est neighbor search (called NN-tree) and (2) the RNN-tree for reverse nearestneighbor search. In order to avoid the maintenance of two separate structures,Yang and Lin [129] combine the two indexes in the RdNN-tree. Similar to theRNN-tree, a leaf node of the RdNN-tree contains vicinity circles of data points.On the other hand, an intermediate node contains the MBR of the underlyingpoints (not their vicinity circles), together with the maximum distance fromevery point in the sub-tree to its nearest neighbor. These techniques extend amultidimensional index structure to store each object along with its nearestneighbor distance and, thus, actually store hyper spheres rather than points.

Stanoi et al. [115] eliminate the need for pre-computing all NNs by uti-lizing geometric properties of RNN retrieval. Tao et. al. [118] propose amethod which utilize a conventional data-partitioning index (e.g. R-tree) onthe dataset and do not require any pre-computation. Their method is designedfor exact processing of RkNN queries with arbitrary values of k on dynamicmultidimensional datasets. Their framework follows a filter and refinementstrategy. Specifically, the filter step retrieves a set of candidate results that isguaranteed to include all the actual reverse nearest neighbors; the subsequentrefinement step eliminates the false hits. The two steps are integrated in aseamless way that eliminates multiple accesses to the same index node (i.e.,each node is visited at most once). Basically, this method store the objectsin conventional multidimensional index structures without any extension andcompute a Voronoi cell during query processing.

Achtert et. al. [1] propose the first approach for efficient RkNN search inarbitrary metric spaces where the value of k is specified at query time. Theirapproach uses the advantages of existing metric index structures as well asconservative and progressive distance approximations to filter out true dropsand true hits. In particular, they approximate the k-nearest neighbor distancefor each data object by upper and lower bounds using two functions (of twoparameters each). Their solution is based on the hypersphere approach for theRkNN problem with an arbitrary k not exceeding a given threshold parameterkmax for general metric objects.

6.3 Problem Definition 141

6.3 Problem Definition

Let D = (D1, . . . , Dd) be a d-dimensional data space and P ⊆ D be a dataset. A point p ∈ P can be represented as p = (p1, p2, . . . , pd) with pi ∈ Di,i ∈ {1, . . . , d}. A point p ∈ P is said to dominate another point q ∈ P ,denoted as p ≺ q, if (1) for every i ∈ {1, . . . , d} : pi ≤ qi; and (2) for at leastone j ∈ {1, . . . , d} : pj < qj . The skyline of P is a set of points SL ⊆ P whichare not dominated by any other point. That is, SL = {p ∈ P | 6 ∃q ∈ P : q ≺p}. The points in SL are called skyline points of P . Figure 6.3 illustrates adatabase of eight objects P = {p1, p2, ..., p8} each representing a car with twoattributes Price and Mileage. Figure 6.3(a) shows the corresponding pointsin the 2-dimensional space, where x and y axes correspond to the attributesMileage and Price, respectively. In Figure 6.3(a), the point p1 dominates thepoint p2. Overall, the skyline is the set SL = {p5, p1, p3}.

In the remaining section, we first define the dynamic skyline query usingthe above notation. Then we introduce our reverse skyline query based on thedefinition of the dynamic skyline query.

price

mileage O

p 6

p 5

p 7

p 8

p 2

p 1

p 3

p 4

20 40 60 80 100

5

10

15

20

25

(a) Points in a 2-d Space

Price Mileage ID

p 1

p 2

p 3

p 4

p 5

p 6

p 7

p 8

5,000

7,500

2,500

7,500

24,000

20,000

26,000

16,000

30K

42K

70K

90K

20K

50K

70K

80K

(b) Database Table

Fig. 6.3. A Database Example.

6.3.1 Dynamic Skyline Query

The general dynamic skyline specifies a new d′-dimensional space based onthe original d-dimensional data space. First, each point p ∈ P is mapped toa point p′ = (f1(p),. . ., fd′(p)) where each fi is a one-dimensional function.Then, the dynamic skyline of P with respect to functions f1, . . . , fd′ returnsthe ordinary skyline of the transformed d′-dimensional space derived from thedata set P ′. For sake of simplicity, we assume in the following that d′ = dand that for a given query point q f i(p) = |qi − pi|, i.e., f simply refers tothe absolute distance to the query point q in the i-th dimension. Note thatthe following results still hold for a more general class of distance functions.


As an example, let us mention that the left and right part of the absolutedistance function can receive different weights. This is important in variousapplications. For example, a car with 1 liter higher fuel consumption thanspecified in the user preference might be less preferable than a car with 1 literless.

Definition 6.1. (Dynamic Skyline Query)Given a query point q and a data set P , a Dynamic Skyline Query (DSQ)according to q retrieves all data points in P that are not dynamically domi-nated. A point p1 ∈ P dynamically dominates p2 ∈ P with regard to the querypoint q if (1) for all i ∈ {1, . . . , d}: |qi − pi

1| ≤ |qi − pi2|, and (2) at least one

j ∈ {1, . . . , d}: |qj − pj1| < |qj − pj

2|.In the above definition, it is equivalent to compute the traditional skyline,

having transformed all points in the new data space where point q is the originand the absolute distances to q are used as mapping functions. Consider, forinstance, Figure 6.4(a). A user specifies a preference q for a car to requestthe most interesting cars with respect to the absolute distances to q whereattributes Price and Mileage is taken into account. Each point p = (p1, p2)in the original 2-dimensional space is transformed to a point p′ = (|p1 −q1|, |p2 − q2|) in the 2-dimensional distance space. This figure illustrates thateach database point is mapped to the new distance space where q becomes theorigin. The dynamic skyline consists of points p7, p6, p2 and p4. Note that pointp8 is not part of the dynamic skyline because it is dynamically dominated bypoint p2.

q

y

x O

p 6

p 5

p 7

p 2

p 1

p 3

p 4

p 8

(a) Dynamic Skyline of q

y

x O

p 4

q p 2

p 6

(b) Reverse Skyline of q

Fig. 6.4. Dynamic and Reverse Skyline.

The terms original space and transformed space refer to the original d-dimensional data space, while the transformed space refers to the data ob-tained from the d mappings f1, f2,. . ., fd.

6.4 Branch and Bound Processing of Reverse Skyline Queries 143

6.3.2 Reverse Skyline Query

Based on the definition of dynamic skyline we now formally define the reverseskyline of a point.

Definition 6.2. (Reverse Skyline Query)Let P be a d-dimensional data set. A Reverse Skyline Query (RSQ) accordingto the query point q retrieves all points p1 ∈ P where q is in the dynamicskyline of p1. Formally, a point p1 ∈ P is a reverse skyline point of q ∈ P iff6 ∃p2 ∈ P such that (a) for all i ∈ {1, . . . , d}: |pi

2 − pi1| ≤ |qi − pi

1| and (b) forat least one j ∈ {1, . . . , d}: |pj

2 − pj1| < |qi − pi

1|.Let us illustrate the above definition by considering our running example.

Figure 6.4(b) depicts the RSQ of point q. As illustrated, point p2 is a reverseskyline point of q since, according to above definition, point q is part of thedynamic skyline of point p2.

The naive brute-force search algorithm for finding the reverse skyline ofP given a query point q requires an examination of all points in P . For eachpoint, a dynamic skyline query is performed (e.g. using BBS) to find thepoints which have q as part of their dynamic skyline. These points constitutethe reverse skyline set. A first optimization of this brute-force approach is tostop processing the dynamic skyline of a point when q is already identified asa skyline point. In this case, there is no need to compute the entire skyline.Unfortunately, this simple optimization only leads to marginal improvements.

6.4 Branch and Bound Processing of Reverse SkylineQueries

In this section we propose the Branch and Bound Reverse Skyline (BBRS)algorithm for reverse skyline computation, which is an improved customizationof the original BBS algorithm [101]. Similar to [101], we assume that thedata points are indexed by a data partitioning access method (e.g., R*-tree[13]). We start by providing some important lemmas and we continue with adescription of our proposed BBRS algorithm. Finally, we provide an analysisof BBRS.

6.4.1 Preliminaries

In this section we exploit the geometric properties of reverse skyline pointswith respect to an arbitrary query point. We assume that the data set P andthe query point q are given. We first define the global skyline in order to reducethe search space in finding reverse skyline points. For this, we prove one lemma(Lemma 6.4) that helps to immediately identify candidate reverse skylinepoints. In addition, we provide another lemma (Lemma 6.5) to eliminate someof these candidate reverse skyline points not contributing to the result set.


Definition 6.3. (Global Skyline)A point p1 ∈ P globally dominates p2 ∈ P with regard to the query point qif (1) for all i ∈ {1, . . . , d}: (p1

i − qi)(p2i − qi) > 0, (2) for all i ∈ {1, . . . , d}:

|p1i −qi| ≤ |p2

i −qi| and (3) for at least one j ∈ {1, . . . , d}: |p1j −qj | < |p2

j −qj |.The Global SkyLine of a point q, GSL(q), contains those points which are notglobally dominated by another point according to q.

Figure 6.5 shows an example of the global skyline of point q and thecorresponding dominance regions. Note that the global skyline is differentfrom the dynamic skyline as there is no space transformation. We now presentan important lemma which proves that the global skyline set is sufficient toanswer the reverse skyline of q correctly.

Lemma 6.4. Let q be the query point and GSL(q) be the set of global skylinepoints and RSL(q) the set of reverse skyline points. Then, RSL(q) ⊆ GSL(q).

Proof. Let x /∈ GSL(q). Then, there is a point y ∈ GSL(q) that dominates x.Since for all i ∈ {1, . . . , d}: |qi−xi| ≥ |qi−yi| and there exists a j ∈ {1, . . . , d}:|qj − xj | > |qj − yj |, it follows from Definition 6.2 that x /∈ RSL(q).

Lemma 6.4 enables our RSQ algorithms to efficiently retrieve a subset ofthe data points in P which are potential reverse skyline points of q by simplyexamining the global dynamic skyline of q. In Section 6.4.2, we utilize Lemma6.4 to find the candidate reverse skyline set. The following lemma help us toeliminate some of these candidate reverse skyline points not contributing tothe result set.

y

x O

p 6

p 7

p 8

p 2

p 3

p 4

q

p 5

p 1

Fig. 6.5. Global Skyline Example

Lemma 6.5. Given the global skyline GSL(q) of point q. Assume point s ∈GSL(q) is a global skyline point. If ∃p ∈ P such that for all i ∈ {1, . . . , d}|pi − qi| < |si − qi| then point s is not a reverse skyline point of q.


Proof. Assume that, the window query returns p as an answer. This means,p dominates q relative to its distance to s. Therefore, s is not in RSL(q).

To illustrate, consider the rectangle centered at point p2 in Figure 6.5. Theextent of the rectangle is defined by the coordinate-wise distances to q. If thereis no point inside this rectangle, then p2 is in RSL(q). Otherwise there is apoint in P which dominates q relative to its distance to p2. Therefore, p2 is notin RSL(q). With the result of Lemma 6.5, we reduce the time complexity ofour algorithms by disregarding the dominance tests against the global skylineset.

6.4.2 Description of BBRS

The Branch and Bound Reverse Skyline (BBRS) algorithm computes the re-verse skyline of a point q by expanding the entries of the heap H accordingto their distance from q. The algorithm works as follows: First, the algorithmcomputes (using Lemma 6.4) the set GSL(q) of candidate reverse skylinepoints which is an upper bound of the actual reverse skyline set (i.e. no falsehits are produced). These candidates are subsequently refined to the actualresult. For this we run a window query for each skyline point s of the globalskyline and if the query returns no answer we are sure (based on Lemma 6.5)that point s is a reverse skyline of q.

For the following discussion, we use the set of 2-dimensional data pointsorganized in the R-tree of Figure 6.2. BBRS starts from the root node ofthe R-tree and inserts all its entries (N5, N6) in a heap sorted by their dis-tance from q. Then, the entry with the minimum distance (N5) is expanded.This expansion removes the entry (N5) from the heap and inserts its chil-dren (N1, N2). The next expanded entry is again the one with the minimumdistance from q (N1), in which the first global skyline point (p2) is found.This point (p2) belongs to the reverse skyline, as the window centered at p2

(Figure 6.5) is empty and is used for pruning in the subsequent execution.Point p2 is inserted to the list RSL of reverse skyline points. Note, that pointp1 is globally dominated by p2 and as soon as p1 is on the top of the heapit is immediately discarded without the window test. The next entry to beexpanded is N6. BBRS proceeds with the node N6 and inserts its children(N4, N3). The heap now becomes (N4, N2, p1, N3). The algorithm proceedsin the same manner until the heap becomes empty thus all reverse skylinepoints are inserted in the result set RSL. The pseudo-code for BBRS is shownin Algorithm 13.

Two important issues need to be addressed. Firstly, the window query canbe implemented as an empty range query. An empty range query, known alsoas boolean range query [113], is a special case of range query which will returneither true or false depending on whether there is any point inside the givenrange or not. Obviously, empty range queries can be handled more efficientlythan equivalent standard range queries and this advantage is exploited in the


Algorithm 13 BBRS (R-tree R, Query point q)1: RSL ← {} //set of reverse skyline points;2: insert all entries of the root R in the heap H sorted by distance from q;3: while (heap H is not empty) do4: remove top entry e;5: if (e is globally dominated by some point in S) then6: discard e;7: end if8: if (e is an intermediate entry) then9: for (each child ei of e) do

10: if (ei is not globally dominated by some point in S) then11: insert ei into heap H;12: end if13: end for14: else15: insert the pruning area of ei into S;16: execute the window query based on e and q;17: if window query is empty then18: add e to the result set;19: end if20: discard e;21: end if22: output RSL23: end while

proposed schemes. For each candidate point, we define a rectangular rangewith the candidate point as the center and the coordinate-wise distance to thequery point as its extent. If this boolean range query returns false then thepoint belongs to the reverse skyline, otherwise point is not a reverse skyline.The main strength of an empty range query over traditional range queriesis that even if multiple MBRs intersect a search region we do not need toaccess all of them. This is because, for example, if at least one edge of anMBR is already inside the search region then it follows that there is at leasta point qualifying the query. Secondly, the dominance test can be expensiveif the skyline contains numerous points. In order to speed up this task weinsert the skyline points found in a main-memory R-tree. Notice that an entryis tested for dominance twice: before it is inserted in the heap and beforeit is expanded. The second test is necessary because an entry in the heapmay become dominated by some skyline point discovered after its insertion(therefore it does not need to be visited).

6.4.3 Analysis of BBRS

In this subsection we provide an analysis of BBRS. At first we present someimportant lemmas which guarantee the correctness of our algorithm. After


that, we prove the efficiency of the branch and bound algorithm in terms ofnode accesses.

Lemma 6.6. BBRS visits (leaf and intermediate) entries of an R-tree in as-cending order of their distance to the query point q.

Proof. The proof is straightforward since the algorithm always visits entriesaccording to their mindist order preserved by the heap.

Lemma 6.7. Any data point added to RSL during the execution of the algo-rithm is guaranteed to be a final reverse skyline point.

Proof. This is guaranteed by Lemma 6.5.

Lemma 6.8. Every data point will be examined, unless one of its ancestornodes has been pruned.

Proof. The proof is obvious since all entries that are not pruned by an existingglobal skyline point are inserted into the heap and examined.

Lemmas 6.6 and 6.7 guarantee that, if BBRS is allowed to execute until itstermination, it correctly returns all reverse skyline points, without reportingany false hits. Now, we prove that BBRS retrieves as candidates only the nodesthat may contain reverse skyline points, and does not access the same nodetwice. Central to the analysis of BBRS is the concept of the global skylinesearch region (GSSR) of point q, that is, the part of the data space that isnot globally dominated by any skyline point.

Lemma 6.9. If an entry e does not intersect the GSSR, then there is a skylinepoint p whose distance from the query point q is smaller than the mindist ofe.

Proof. Since e does not intersect the GSSR, it must be dominated by at leastone skyline point p, meaning that p dominates e. This implies that the distanceof p to the query point q is smaller than the mindist of e.

Lemma 6.10. The number of candidates examined by BBRS is minimum.

Proof. Assume, to the contrary, that the algorithm also visits an entry ethat does not intersect the GSSR. Clearly, this entry should not be accessedbecause it cannot contain reverse skyline points. Consider a reverse skylinepoint that dominates e (e.g., k). Then the distance of k to the q is smallerthan the mindist of e. According to Lemma 6.6, BBRS visits the entries ofthe R-tree in ascending order of their mindist to the q. Hence, k must beprocessed before e, meaning that e will be pruned by k, which contradicts thefact that e is visited. In order to complete the proof, we need to show thatan entry is not visited multiple times. This is straightforward because entriesare inserted into the heap (and expanded) at most once, according to theirmindist from q.


6.5 Reverse Skyline Computation using SkylineApproximations

In this section we propose an enhanced algorithm, called Reverse Skyline us-ing Skyline Approximations (RSSA) algorithm for supporting reverse skylinequeries which is based on the well known filter-refinement paradigm. The mainidea is to compute the dynamic skyline for each database object and to keep afixed-sized approximation of this skyline on disk. By using this approximationin the filter step, we are able to identify points being in the reverse skylineas well as to filter out points not being in the reverse skyline. The remainingcandidate points are then further examined in the refinement step. Similar tothe method presented in the previous section, a window query is issued foreach candidate, but the size of the window can be substantially reduced dueto the approximation again. Consequently, this also leads to substantial costsavings.

6.5.1 Basic Observations

In this subsection we present an important lemma which allows our RSSAalgorithm to compute the reverse skyline for every point by eliminating un-necessary dominance tests (i.e. window queries) for (1) definite reverse skylinepoints and (2) points not belonging to the result.

Lemma 6.11. For a given point p, let DSL(p) be the set of dynamic skylinepoints of p. Let q be a query point. If there is an s from DSL(p) that dynami-cally dominates the point q then and only then point p is not a reverse skylineof q.

A straightforward corollary of the above lemma is the following: if thequery point q dominates a point s from DSL(p), then we conclude that pointp is a reverse skyline of q. The intuition behind the above lemma is the factthat whenever we have to test a point p whether it is a reverse skyline pointof q or not, it is sufficient to examine in which of the two regions, definedby the skyline of p, the point q belongs to. The Dynamic Dominance Re-gion of p - DDR(p) - contains the points dominated by at least one skylinepoint. Whereas, the Dynamic Anti-Dominance Region - DADR(p) - containsthe points dominating some skyline point. If the query point q falls insideDADR(p), then, based on the above observation, we conclude that point p isa reverse skyline point of q. In contrary, if the query point q is in DDR(p),we can discard point p because it is not in the reverse skyline of q. Figure6.6(a) illustrates these two cases where the skyline of p contains seven points.In this figure point q1 is inside the DADR of p and for that reason point p isa reverse skyline of q1. This is because point q1 dominates some point of thedynamic skyline of p (in particular point s1). To explain, after the insertion ofthe point q1, the modified skyline (shown with dashed lines in Figure 6.6(a))

6.5 Reverse Skyline Computation using Skyline Approximations 149

would also contain the point q1. Quite in contrary, in the same figure pointq2 is dominated by some point of the dynamic skyline of p (i.e. falls insideDDR(p) and is dominated by s1) and for that reason point p is not a reverseskyline of q2.

Based on this observation we can design a first algorithm for reverse skylinecomputation of an arbitrary query point q. In a pre-processing step, we firstcompute the dynamic skyline for every point and store these skylines on disk.When a query q is issued, we compute its global skyline and check for eachpoint p in the global skyline whether q is in DADR(p) or in DDR(p). The prob-lem of this algorithm is however its huge storage overhead. For independentdimensions the expected number of skyline points is θ((log N)d−1/(d − 1)!)[16]; therefore, the total storage cost of keeping all dynamic skylines is thensuper-linear in N , the number of objects.

x

y

s 1

DDR

DADR

q 1

q 2

p

(a) Lemma 6.11

x

y

s 1

s 2

s 3

DDR

DADR p

(b) DDR and DADR of p

Fig. 6.6. Illustration of DDR(p) and DADR(p).

In order to overcome this problem, we propose to keep fixed-size progres-sive approximations of DADR and DDR for each database point rather thankeeping the exact regions. The basic idea of our approximation scheme is toselect a sample of k points (k ≤ kmax where kmax denotes the actual dy-namic skyline points) from the dynamic skyline of point p. Hence, the storageoverhead of our approach is linear in the number of objects (as k is consid-ered to be a constant). Then, the union of the dominance regions of thesesamples (constituting DDR(p)) is a progressive approximation of DDR(p),while the union of the anti-dominance regions (forming the DADR(p)) is aprogressive approximation of DADR(p). Consider Figure 6.6(b), where sevenskyline points (black and white) are shown. The three black ones are selectedas samples for the approximation, i.e., points s1, s2 and s3. Consequently, thecorresponding regions DADR(p) and DDR(p) are the union of the dominanceregion and the anti-dominance region of these three black points, respectively.The part of the dynamic space of a point p, after removing DADR(p) and


DDR(p), constitutes the approximated skyline. Figure 6.7 shows an exampleof the approximated skyline of p. Methods to compute such approximatedskylines, given a value k of skyline points (k ≤ kmax), are discussed in Section6.6.

6.5.2 Description of the Algorithm

The enhanced RSSA algorithm for reverse skyline query processing on ourR-tree based framework is explained next. An R-tree is built on the datasetand a skyline approximation of every database point is stored on disk. Theinitialization step remains the same; a global skyline query GSL(q) is per-formed returning the data objects belonging to the global skyline of the queryand in the next step these objects are efficiently analyzed to answer the query.Note that the query point q is transformed to the new space (according to thecandidate) in order to examine in which region of the approximate skyline itbelongs to. The remainder of this subsection explains these two steps in moredetail with necessary optimizations that can be used.

Filter Step

While calculating the global skyline of a query point we need to test thosecandidate reverse skyline points. However, some of them can be ruled out fromfurther elaboration by examining their skyline approximations. The principleof the elimination is based on Lemma 6.11 and means that a point from theapproximated skyline can dynamically dominate the query point (i.e. querypoint q is in DDR(p)) and therefore this candidate can never be the reverseskyline of the query point in the whole data set. In addition, definite reverseskyline points can immediately be reported by taking advantage of DADR(p).In summary, within the filter step we distinguish between the following threecases:

1. A point p can be dropped if q ∈ DDR(p).2. Otherwise, if q ∈ DADR(p) point p can be added to the result set and is

a reverse skyline point.3. In case that q /∈ DADR(p) ∪ DDR(p), point p needs to be refined.

This means, as soon as we find a candidate reverse skyline point s, we readthe approximate skyline of s from disk and we check if point q falls inside thedominance region of some approximate skyline point. If this is the case, we cansafely drop s. We refer to this step as filter step. Our experiments show thatthis step can efficiently filter out a significant number of candidates. After thefilter step, we check if the query point falls inside the anti-dominance regionof some approximated skyline point, which means that point s is a reverseskyline point of q.

6.5 Reverse Skyline Computation using Skyline Approximations 151

x

y

s 1

s 2

s 3

Approximated Skyline

q 1

q 2

p

(a) Filter Step

x

y

s 1

s 2

s 3

Approximated Skyline

q

p Dynamic window

Refinement

(b) Refinement Step

Fig. 6.7. Approximated Skyline of p.

Refinement Step

When none of the above cases match, we start the refinement step by issuingan empty range check query. The query region is now the dynamic queryregion, defined by the query point q and the origin of the dynamic space ofp, minus DADR(p). Note that the size of the window is much smaller thanthe original window size and consequently, the query can be performed muchfaster.

Actually, because of the back-transformation to the original space wherepoint p is the center of the original window, the small window is mapped to2d sub-partitions defined by point p. However, in order to further optimizethe refinement step, we issue a 2-step empty range query. The first query isthe original window as defined above, while the second corresponds to the 2d

projected smaller windows. If the first range query is empty, we don’t need toaccess any nodes. In the latter case, we only test with the smaller windows.Figure 6.8 illustrates the refinement step in the original space.

x

y

s 1

s 2

s 3

q

p

Original window

Refinement

O

Fig. 6.8. Refinement in the Original Space


We describe the RSSA algorithm using the example shown in Figure 6.7,where the points (q1, q2, q) correspond to our query points. A global skylinequery GSL(q) is performed returning the data objects belonging to the globalskyline of the query and in the next step these objects are efficiently analyzedto answer the query. Assume now, point p is returned by the global skylinequery as a candidate reverse skyline point. Considering Figure 6.7(a), becausequery point q1 is in DADR(p), the filter step is sufficient to identify p beingin the reverse skyline. On the other hand, query point q2 is in DDR(p) andtherefore our filter step ensures that point p can be discarded and is not areverse skyline point. Now consider 6.7(b) where point q is neither in DDR(p)nor in DADR(p) and therefore, a refinement step is necessary by issuing a2-step empty range query. Note that the lower left corner of the dynamicwindow is the origin of the data space. As we now can take benefit from

DADR(p), the window only corresponds to the region with the grid pattern.Though the refinement step is required in this case, the much smaller windowleads to substantial performance savings. The sketch depicted in Figure 6.9summarizes our general approach for answering reverse skyline queries.

query object q

Global skyline query on the index to find candidates for RSL(q)

Run window query on the remaining candidates to determine actual RSL

Efficiently eliminate some candidates by using DDR and report

true hits using DADR Filter Step

Index (e.g. R-tree)

Final RSL

Skyline Approximations Refinement Step

Fig. 6.9. The Multi-Step Algorithm

6.5.3 Updates

In the following, we present different strategies for updating our pre-processedapproximations when points are inserted into and removed from the database.It should be noted here that our approach is primarily designed for query-intensive environments with a modest rate of updates and that the dynamicmaintenance of our approximation scheme is rather expensive as the compu-tation of skylines is unavoidable.

Let us first discuss the case of inserting a new point x into the database.In addition to inserting x into the R-tree, the approximations of the affected

6.6 Approximation of Skylines 153

points have to be updated. Therefore, the global skyline GSL(x) is computedand the approximations DADR(y) and DDR(y) of every point y of GSL(x)have to be checked for updates. There are the following two cases: First, x is inDDR(y). Then, x has no impact on the skyline and its approximation. Second,x is not in DDR(y), i.e., x may effect the skyline. Then, the dynamic skylineof y and its corresponding approximation are computed. As the latter casemight be quite expensive, we can trade in cost for approximation quality inthe following way. When x is not in DADR(y), we simply do not recomputethe approximation. Moreover, when x is in DADR(y), i.e., x dominates atleast one of the sample points, we remove all the dominated sample pointsfrom the approximation and insert x into the sample. Again, there is no needfor recomputing the approximation.

Now let us consider the deletion of a point x from the database. First, x isdeleted from the R-tree and then, the global skyline GSL(x) is computed andthe approximations of every points y of GSL(x) have to be checked for update.There are now two cases. Firstly, x is in DDR(y), i.e., x is not a point of thedynamic skyline. Consequently, there is also no impact on the approximation.Secondly, x is not in DDR(y). Then, y could be a member of the dynamicskyline and therefore, the dynamic skyline is computed. If y is indeed a skylinepoint, the approximation of the skyline is recalculated. A much less expensivealternative for the second case would be to avoid recalculation at all. Instead,we only test whether x is in the approximation. If so, the point is deleted fromthe approximation.

6.6 Approximation of Skylines

The basic idea of our approximation scheme is to pre-compute the dynamicskyline for each object of the database and to select a fixed number (k) ofskyline points. Hence, the storage overhead of our approach is linear in thenumber of objects (as k is considered to be a constant). This section dealswith the problem of computing a good progressive approximation DADR(p)and DDR(p). The goal is to maximize the probability that the refinementstep is not required. This translates into maximizing the volume of DADRand DDR (the volume is defined as the number of points in the regions).

Before presenting non-trivial methods for calculating such approximations,let us first mention two naive approaches. The first naive approach is simplyto compute a random sample of size k from the skyline points. The secondnaive approach is to sort the points first according to a specific dimension.Then, every (m/k)th point is drawn from the sorted sequence where m de-notes the number of elements. Both methods require that the entire skylineis available. In order to reduce the overhead of the pre-processing step, yetanother approach would be to stop computing skylines after having received


the first k. For BBS, this leads to poor approximations and therefore we didnot consider it in the following discussions.

6.6.1 Optimal Approximation for 2-dimensional Skylines

An optimal algorithm for the approximation can be provided for two-dimensionaldata. The basic idea of the optimal algorithm is similar to the ones for approx-imating histograms [76] and time series [64]. The problem can be solved by adynamic programming algorithm returning the optimal approximation. Thereader is referred to [76] where the idea is nicely presented for approximatinghistograms.

Suppose that S = {(x1, y1), ...,(xm, ym)} is a collection of m skyline pointsin a two-dimensional space. We sort the skyline points first in the ascendingorder of y-coordinate values; consequently, they are also sorted in the descend-ing order of x-coordinate values. Furthermore, we treat the extreme points(x0, y0) = (0, 1) and (xm+1, ym+1) = (1, 0) separately.

For simplicity, we consider only the Dynamic Dominance Region (DDR)in the following discussion. For 0 ≤ i ≤ j ≤ m, the error metric we considerin this chapter is:

S([i, j]) = (j∑

k=i

xk ∗ (yk+1 − yk))− xi ∗ (yj+1 − yi)).

This function considers the case in which we choose first the i-th skylinepoint and thereafter the j+1-th point. It is important that the error metricused in our algorithm is monotone as expressed in the following lemma.

Lemma 6.12. For any skyline S and any values of i, j, k with 0 ≤ i ≤ k <j ≤ m the following holds:

S([i, j]) ≤ S([i, k]) + S([k, j]).

Similar to [76], we calculate the error function SSE*(i, k) using dynamicprogramming. This corresponds to the best approximation of k points for thefirst i skyline points, i ≥ k. The optimal algorithm for approximating skylinesis outlined in Algorithm 14.

The optimal algorithm runs in O(m2 ∗ kmax) iterations and has a spacecomplexity of O(m∗kmax) where m is the size of the skyline and kmax denotesthe sample size.

6.6.2 A Greedy Algorithm for d-dimensional Skylines

While we can achieve an optimal approximation for 2-dimensional skylines,the problem of an optimal approximation for d-dimensional skylines turns out

6.6 Approximation of Skylines 155

Algorithm 14 OptimalSelect()1: Input: Skyline S, sample size kmax

2: Output: Sample Sam3: if (|S| ≤ k) then4: return S;5: end if6: for (int k = 1; i < kmax; k++) do7: for (int i = 1; i < m; i++) do8: SSE*(i, k) =

min1<j<i (SSE*(j, k-1) + SSE([j+1,i]));9: end for

10: Sam = {(xj , yj) — index j was selected in SSE*(m, kmax)};11: end for12: return Sam

to be more complex for d > 2. As the monotonic property, see Lemma 6.12,no longer holds, dynamic programming does not guarantee the delivery of anoptimal approximation. A naive algorithm is to test all possible subsets of sizek which would result in a runtime of O(mk). The interested reader is referredto [97] where efficient heuristics are discussed for obtaining approximate so-lutions. In this paper, we limit our discussion to a greedy-based algorithmwith O(m∗kmax) iterations. The basic idea is to select the point that has notbeen examined yet which adds the highest volume increase. The algorithm isoutlined in Algorithm 15.

Algorithm 15 GreedySelect()1: Input: Skyline S, sample size k2: Output: Sample Sam3: LSB = {}; USB = {}; Sam = {};4: if (|S| ≤ k) then5: return S;6: end if7: for (int i = 0; i < k; i++) do8: x = argmaxp∈SV OL(ADR(p) ∪ LSB)

+ V OL(DR(p) ∪ USB);9: insert(Sam, x);

10: remove(S, x);11: LSB = LSB ∪ADR(p);12: USB = USB ∪DR(p);13: end for


6.7 Experimental Evaluation

In this section, we experimentally evaluate the efficiency of the proposed tech-niques for reverse skyline computation. We used two real datasets, namelyCarDB and NBA1 in our experiments. Specifically, the used car databaseCarDB, is a 6-dimensional dataset with attributes referring to Make, Model,Year, Price, Mileage and Location. This dataset contains 50,000 tuples ex-tracted from Yahoo! Autos. The two numerical attributes Price and Mileage ofa car are considered in our experiments. NBA contains 17,000 13-dimensionalpoints. Each record provides statistics of a player in a season. We selected fourattributes: number of games played (GP), total points (PTS), total rebounds(REB) and total assists (AST). Finding the reverse skyline in this players’statistics data set makes excellent sense in practice. Coaches are often inter-ested in finding the best substitutes of a player P. Suitable candidates mightbe in the reverse skyline of P.

We also used synthetic data sets with two different distributions. Uni-formly distributed datasets consist of random points from the unit square,whereas the clustered dataset comprises ten randomly centered clusters, eachof them with an equal number of points that follow a Gaussian distributionwith variance 0.05 and mean equal to the associated centroid. We generated2, 3 and 4-dimensional synthetic datasets (uniform and clustered) of varyingsizes, ranging from 20,000 to 80,000 points. The values of the attributes arein the range [0; 10,000].

All experiments have been performed on a Windows PC with a 32-bit3.2 GHz CPU and 2 GB main memory. In each experiment we performed100 reverse skyline queries to the particular data set and reported the overallresult. The queries follow the distribution of the dataset. Each dataset isindexed by an R-tree, where the page size is set to 4KB in all cases. Allevaluated methods have been implemented in Java using the XXL libary [42].

6.7.1 Tuning the Skyline Approximation

The first set of experiments examines the impact of the number k of approx-imate skyline points on the performance of our RSSA algorithm. For everysynthetic dataset (Uniform and Clustered) we created five R-trees by vary-ing k from 10 to kmax = 50 skyline points. Then, we used the R-trees toprocess a query workload, and measure the average (per-query) number ofpage accesses. Figures 6.10(a) and 6.10(b) plot the cost as a function of k,for workloads with d = 3. Note that the result for k = 0 corresponds to theoverhead of the basic BBRS that does not approximate the skyline (Section4). As k becomes larger, the query performance improves continually. This isexpected because we are using better approximations of the skylines.

In Figure 6.10(c) we report the results of our approximation algorithms(optimal and greedy) for selecting the skyline points used in the approximation1 These datasets can be downloaded from autos.yahoo.com and www.nba.com


0

500

1000

1500

2000

2500

3000

3500

0 10 20 30 40 50

Parameter k

I/Os

(a) Uniform (3-d, 50k)

0

200

400

600

800

1000

1200

0 10 20 30 40 50

Parameter k

I/Os

(b) Clustered (3-d, 50k)

0

50

100

150

200

250

300

350

400

0 5 10 15 20 Parameter k

I/Os Greedy Optimal

(c) CAR dataset (2-d,50k)

Fig. 6.10. Performance for different values of k.

for the 2-dimensional CAR dataset. We created 3 R-trees by varying k from5 to 15 skyline points; in addition we created one R-tree to measure theoverhead of the basic BBRS (k=0). Then, we used each tree to process aquery workload. We observe that there is only a mild difference between thegreedy and the optimal algorithm. In the sequel, we set k to 10 and 30 forCarDB and NBA, respectively, which provides the best overall performance.

6.7.2 Examination of Reverse Skyline Queries

We now examine the performance of our algorithms for reverse skyline com-putation for all datasets. Since there is no competitive approach for reverseskyline search, we only considered both of our algorithms. BBRS does notpre-compute skyline approximations. Obviously, this approach has less stor-age overhead than the RSSA but needs expensive refinement steps. The opti-mized RSSA algorithm, stores all pre-computed k skyline points on disk andtherefore requires at most one page access extra for each examination.

Pruning Capabilities

In Figures 6.11(a) and 6.11(b), the results are plotted for the global vs. thereverse skyline size as a function of dimensionality for the synthetic datasets.More precisely, we used datasets with 50,000 points whose d varies from 2to 4. Based on the fact that the global skyline is a superset of the reverseskyline, we observe that our BBRS algorithm prunes effectively the searchedspace during the reverse skyline computation. To examine the skyline sizes forthe real datasets, we plotted the average number of global and reverse skylinepoints for the CarDB and NBA datasets in Figure 6.11(c).

Figure 6.12(a) shows the pruning capability of RSSA w.r.t. k on the 3-dimensional Clustered data set. Compared with the size of the result set, onlya small number of candidates has to be examined, i.e., DDR yields a soundupper bound. Furthermore, the number of true hits we get from our DADRapproximation increases with increasing k. For these objects no expensiverefinement step is necessary, thus our DADR approximation provides a very


6,8 52,5

234,3

31,6

278,2

1325,8

0

200

400

600

800

1000

1200

1400

2 3 4 dimensionality

Points Reverse Skyline

Global Skyline

(a) Uniform dataset(N=50,000)

8,75 35,3

107,25

26,9

164,15

516,55

0

100

200

300

400

500

600



Global Skyline

(b) Clustered dataset(N=50,000)

4,2 7,05 15,65

228,65

0

50

100

150

200

250

CAR (d=2, N=50k) NBA (d=4, N=17k)


Global Skyline

(c) Real datasets

Fig. 6.11. Reverse vs. Global Skyline Size.

0

50

100

150

200

250

0 10 20 30 40 50 Parameter k

Points Discarded True hits Refined

(a) Filter vs. Refinement

0

50

100

150

200

250

300

350

400

0 10 20 30 40 Parameter k

Window RSSA

(b) Window Radius

Fig. 6.12. RSSA performance for different values of k.

accurate lower bound for the skyline. Figure 6.12(b) shows the average windowradius as a function of k for the same dataset.

Evaluation of the Algorithms

Next, we examine the cost of BBRS and RSSA for answering each query onall datasets. In Figures 6.13(a) and 6.13(b), we depict the results of the com-parison for the synthetic and real datasets, respectively. In Figure 6.13(a),the results are presented for reverse skyline queries on the Clustered dataset,whereas 6.13(b) provides the results of the same experiments on CarDB andNBA, except that the y-axes are in logarithmic scale. RSSA consistentlyachieves lower average cost than BBRS (by one order of magnitude in Figure6.13(b)). This experiment demonstrates the efficiency of our algorithms on allexamined datasets.

6.7.3 Scalability with the Dimensionality and Dataset Cardinality

In this section we experientially confirm that our algorithms (BBRS andRSSA) are scalable according to the dimensionality and the cardinality ofthe dataset. In this experiment, we only used synthetic datasets. For eachdataset, we choose (for RSSA) the number k of skyline points that optimizethe overall performance (through a tuning process similar to Figure 6.10).


0

500

1000

1500

2000

2500

3000

3500

Uniform Clustered

I/Os RSSA BBRS

(a) Synthetic datasets

1

10

100

1000

10000

100000

CAR NBA

I/Os RSSA BBRS

(b) Real datasets

Fig. 6.13. Average cost for our algorithms.

Specifically, we set k equals 10 for the dataset containing 20,000 points, andk equals 30 for the others.

Effect of Dimensionality

For the purpose of this experiment we used both synthetic datasets. In orderto examine the impact of the dimensionality d on our algorithms, we useddatasets with 50,000 points with d varying from 2 to 4. In Figure 6.14(a), thecost of each method (in retrieving reverse skylines) is plotted as a function of dfor uniform distributions. Both algorithms scale well, although our optimizedRSSA algorithm outperforms BBRS significantly (up to an order of magnitudefor d=4). In Figure 6.14(b), the x-axis represents the dimensionality whereasthe y-axis measures the page accesses required to calculate the reverse skylinefor the clustered dataset. Similar observations hold.

0

5000

10000

15000

20000

25000

30000


I/Os RSSA BBRS

(a) Uniform (N=50k)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

10000


I/Os

RSSA BBRS

(b) Clustered (N=50k)

Fig. 6.14. Scalability with dimensionality.

Effect of Cardinality

In this experiment we used the 3-dimensional synthetic datasets and we variedthe cardinality between 20,000 and 80,000 points. Figure 6.15(a) compares the


average cost of BBRS and RSSA for uniformaly distruibuted data. Evidently,the optimized method (RSSA) scale better with cardinality than BBRS. Inparticular, RSSA outperforms BBRS for 80,000 points significantly. Figure6.15(b) illustrates the corresponding results for Clustered data, confirmingsimilar observations where k = 10 for the smallest dataset, and k = 30 for theothers.

0

500

1000

1500

2000

2500

3000

3500

4000

4500

20000 40000 60000 80000 cardinality

I/Os RSSA BBRS

(a) Uniform (d=3)

0

200

400

600 800

1000

1200

1400

1600

1800

20000 40000 60000 80000 cardinality

I/Os RSSA BBRS

(b) Clustered (d=3)

Fig. 6.15. Scalability with cardinality.

6.8 Conclusions

In this chapter, we introduced the concept of Reverse Skyline Queries (RSQ).Given a set of data points P and a query point q, an RSQ returns the dataobjects that have the query object in the set of their dynamic skyline. Itis the complimentary problem to that of finding the dynamic skyline of aquery object. Such kind of dynamic skyline corresponds to the skyline of atransformed data space where point q becomes the origin and all points arerepresented by their distance to q.

In order to compute the reverse skyline of an arbitrary query point, we firstproposed a Branch and Bound algorithm (called BBRS), which is an improvedcustomization of the original BBS algorithm. Furthermore, we identified asuper set of the reverse skyline that allows us to bound the space searchedduring the reverse skyline computation. To further reduce the computationalcost of determining if a point belongs to the reverse skyline, we proposed anenhanced algorithm (called RSSA), that is based on accurate pre-computedapproximations of the skylines. These approximations are used to identifywhether a point belongs to the reverse skyline or not. For two-dimensionaldata, we presented an optimal algorithm, while for higher dimensions a greedyalgorithm is proposed. Through extensive experiments with both real-worldand synthetic datasets, we showed that our algorithms can efficiently supportreverse skyline queries.

Part IV

Conclusions and Outlook

7

Summary of Contributions

In recent years, advanced database systems have emerged. They are neces-sary because of the demand for storage, management and retrieval of largeamounts of data in application areas such as Content-based Retrieval, Elec-tronic Market Places and Decision Support Applications in general. In con-trast to conventional database systems, users of advanced database systemsfocus on similarity search and multi-criteria optimization. Based on an exten-sive analysis of objects and properties typical for advanced database systems,we have determined four requirements: (1) the need for efficient index struc-tures to cope with high dimensional data, (2) the re-usability of existing mul-tidimensional index structures which works well in low-dimensional spaces,(3) novel (more expressive) query operators for advanced database systemsand (4) efficient analysis of complex high-dimensional data; which so far wereinsufficiently considered in similarity search and multi-criteria optimization.This thesis presents novel similarity search and skyline retrieval approachesthat are designed to handle high-dimensional data to achieve enhanced results.This chapter summarizes the major theoretical and practical contributions ofthis work.

7.1 Preliminaries (Part I)

The preliminaries in Part I provide some motivation and illustrate the topicand the background of this work. We considered advanced database systemsand analyzed characteristics of data objects and tasks that are typical foradvanced database systems. In Sections 1.2 and 1.3 we introduced basic con-cepts of similarity search and multi-criteria optimization. Based on the typicalcharacteristics of data objects and tasks in advanced database systems, newchallenges for similarity search and multi-criteria optimization techniques wereelaborated in Section 1.4. After a short outline of this thesis in Section 1.5,in Chapter 2 we surveyed important previous work in similarity and skylinequery processing related to techniques proposed in this thesis.

164 7 Summary of Contributions

7.2 Similarity Query Optimization (Part II)

In Chapter 3 we introduced a new approach to indexing multidimensionaldata that is particularly suitable for the efficient incremental processing ofnearest neighbor queries. The basic idea is to split the data space verticallyinto multiple low- and medium-dimensional data spaces. The data from eachof these lower-dimensional subspaces is organized by using a standard multi-dimensional index structure. In order to perform incremental NN-queries ontop of index-striping efficiently, we first develop an algorithm for merging theresults received from the underlying indexes. Then, an accurate cost modelrelying on a power law is presented that determines an appropriate number ofindexes. Moreover, we consider the problem of dimension assignment, whereeach dimension is assigned to a lower-dimensional subspace, such that the costof nearest neighbor queries is minimized.

In Chapter 4, we presented a generalization of the iDistance technique,called Multidimensional iDistance (MiD), for k nearest neighbor query pro-cessing. Three main steps are performed for building MiD. In agreement withiDistance, firstly, data points are partitioned into clusters and, secondly, areference point is determined for every cluster. However, the third step sub-stantially differs from iDistance as a data object is mapped to a m-dimensionaldistance vector where m > 1 generally holds. The m dimensions are generatedby splitting the original data space into m subspaces and computing the par-tial distance between the object and the reference point for every subspace.The resulting m-dimensional points can be indexed by an arbitrary point ac-cess method like an R-tree. Our crucial parameter m is derived from a costmodel that is based on a power law. We presented range and k-NN queryprocessing algorithms for MiD.

7.3 Skyline Query Optimization (Part III)

In Chapter 5 we first introduced the problem of Constrained Subspace SkylineQueries (CSSQ). We presented a query processing algorithm which builds onmultiple low-dimensional index structures. Due to the usage of well perform-ing low dimensional indices, constrained subspace skyline queries for arbitrarylarge subspaces are efficiently supported. In order to support constrained sky-lines for arbitrary subspaces, we present approaches exploiting multiple low-dimensional indexes instead of relying on a single high-dimensional index.Effective pruning strategies are applied to discard points from dominatedregions. An important ingredient of our approach is the workload-adaptivestrategy for determining the number of indexes and the assignment of dimen-sions to the indexes.

Chapter 6 introduces the concept of Reverse Skyline Queries (RSQ). Givena set of data points P and a query point q, an RSQ returns the data objectsthat have the query object in the set of their ’dynamic’ skyline. Such kind of

7.3 Skyline Query Optimization (Part III) 165

dynamic skyline corresponds to the skyline of a transformed data space wherepoint q becomes the origin and all points are represented by their distanceto q. In order to compute the reverse skyline of an arbitrary query point,we first proposed a Branch and Bound algorithm (called BBRS), which is animproved customization of the original BBS algorithm. To reduce even morethe computational cost of determining if a point belongs to the reverse skyline,we proposed an enhanced algorithm (called RSSA), that is based on accuratepre-computed approximations of the skylines. These approximations are usedto identify whether a point belongs to the reverse skyline or not.

8

Discussion on Future Work

At the end of this thesis let us consider possible further directions for re-search which have been motivated by novel techniques for similarity searchand multi-criteria optimization. First, we discuss promising enhancements ofthe methods proposed in this work. In addition, we sketch our vision of thefuture of similarity search and multi-criteria optimization techniques for ad-vanced database systems.

• Experimental study against other methods can be conducted. Most meth-ods proposed for high-dimensional range queries are usually not suitablefor k-NN search, while most indexes designed for k-NN or similarity searchare not able to support window queries. Hence, comparison studies arelikely to be orthogonal. A benchmark database and query set would makethe comparison more meaningful and provide a reference point for otherrelated work.

• The proposed indexing approaches could be integrated into the kernel asin the UB-tree, in order to achieve higher performance gain from the useof these indexing methods. It is important to measure the complexity andcost of integrating new indexing methods into the kernel, and actual gainfrom the integration.

• Similarity or k-nearest neighbor (k-NN) queries are common in knowl-edge discovery. Similarity neighbor join is computationally expensive, andhence efficient similarity join strategies based on the similarity indexes areimportant to efficient query processing. Further study on the use of theproposed nearest neighbor algorithms in similarity join and reverse k-NNsearch is required.

• Dimensionality reduction can be used to minimize the effect of the dimen-sionality curse by reducing the number of dimensions of the high dimen-sional data before indexing on the reduced dimensions. After dimensional-ity reduction, each cluster of data is in different axis system and needs tobe indexed for k-NN queries. Instead of creating one index we aim to studyIndex Striping after dimensionality reduction. Additionally, for the iDis-

168 8 Discussion on Future Work

tance approach instead of creating one index for each cluster, our IndexStriping framework is a good candidate for indexing the data projectionsfrom the different reduced-dimensionality spaces.

• Another direction, which we consider as future work, concerns the requiredscheduler modification in order to examine in which way the memory al-location of the external priority queues influences our nearest neighboralgorithms.

• Similar to k-nearest neighbor queries, another direction for further workwill be the study of reverse k-skyband queries. In addition the examinationof reverse skyline query processing by using constraints on the space is avery interesting direction.

• Moreover, efficient algorithms for computing accurate skyline approxima-tions for three- and higher-dimensional data is another promising directionfor future work.

In summary, the four techniques presented in this thesis provide a goodbasis for further study in the context of high-dimensional database applica-tions.

List of Figures

1.1 Similarity Search in Image Databases. . . . . . . . . . . . . . . . . . . . . . . . 51.2 Multi-step Query Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.3 Example dataset and Skyline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1 Multi-step Range Query Processing [109] . . . . . . . . . . . . . . . . . . . . 232.2 Multi-step k-NN Query Processing [109] . . . . . . . . . . . . . . . . . . . . . 242.3 Multi-step Incremental NN Query Processing [109] . . . . . . . . . . . 242.4 An R-tree example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.5 Discovery of points i and a [89] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.1 Overview of our Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.2 Total Cost of k-NN query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723.3 Sequence Heaps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763.4 Optimizing Sequence Heaps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 773.5 ”Self-Adaptive”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783.6 The Naive Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793.7 The New Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793.8 Accuracy of our Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 813.9 Average number of candidates for 10-NN queries (5 attributes

selected from 16) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 823.10 Average number of I/Os for 10-NN queries for different

attribute assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 833.11 Average number of candidates of k-NN queries as a function

of k (different assignment strategies) . . . . . . . . . . . . . . . . . . . . . . . . 833.12 Average number of candidates and I/Os as a function of k. . . . . 843.13 Average number of candidates for k-NN queries as a function

of k (different scheduling strategies) . . . . . . . . . . . . . . . . . . . . . . . . 853.14 Comparative Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853.15 Average number of candidates and I/Os as a function of k. . . . . 863.16 External Priority Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

170 List of Figures

4.1 The principles of iDistance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.2 Distance errors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 934.3 Example of Data Transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . 944.4 The MiD Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974.5 Total Cost of k-NN query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1034.6 Effect of the Dimensionality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1044.7 Average number of CPU cost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.8 Average number of I/O cost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.9 Average number of I/Os and candidates as a function of k

(Uniform dataset). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1064.10 Average number of I/Os and CPU cost as a function of k.

(Real dataset) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1064.11 Effect of m on k-NN search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.1 Constrained skyline in 2-d subspace . . . . . . . . . . . . . . . . . . . . . . . . . 1125.2 Dominance region of p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.3 Pruned area as function of dimensions . . . . . . . . . . . . . . . . . . . . . . 1175.4 Simultaneous pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1205.5 Pruning strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1215.6 Example of Pruning Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1215.7 Improved pruning heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1235.8 Pruning Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1245.9 Random Grouping Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1265.10 Uniform dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1305.11 Uniform and Color Datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1305.12 Uniform dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1315.13 Uniform and Real Datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1325.14 Uniform Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1335.15 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.1 Example of Reverse Skyline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1376.2 Illustration of the BBS Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 1386.3 A Database Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1416.4 Dynamic and Reverse Skyline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1426.5 Global Skyline Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1446.6 Illustration of DDR(p) and DADR(p). . . . . . . . . . . . . . . . . . . . . . 1496.7 Approximated Skyline of p. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1516.8 Refinement in the Original Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 1516.9 The Multi-Step Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1526.10 Performance for different values of k. . . . . . . . . . . . . . . . . . . . . . . . 1576.11 Reverse vs. Global Skyline Size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1586.12 RSSA performance for different values of k. . . . . . . . . . . . . . . . . . . 1586.13 Average cost for our algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1596.14 Scalability with dimensionality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1596.15 Scalability with cardinality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

List of Tables

1.1 Used car market: offers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Notebook Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1 Overview of symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

References

1. Achtert, E., Bohm, C., Kroger, P., Kunath, P., Pryakhin, A., andRenz, M. Efficient Reverse k-Nearest Neighbor Search in Arbitrary MetricSpaces. In ACM SIGMOD International Conference on Management of Data(2006), pp. 515–526.

2. Aggarwal, C. C., Procopiuc, C. M., Wolf, J. L., Yu, P. S., and Park,J. S. Fast algorithms for projected clustering. In ACM SIGMOD InternationalConference on Management of Data (1999), pp. 61–72.

3. Ankerst, M., Kastenmuller, G., Kriegel, H.-P., and Seidl, T. 3d shapehistograms for similarity search and classification in spatial databases. InInternational Symposium on Advances in Spatial Databases (SSTD) (1999),pp. 207–226.

4. Arya, S., Mount, D. M., and Narayan, O. Accounting for boundary effectsin nearest neighbor searching. In Symposium on Computational Geometry(1995), pp. 336–344.

5. Balke, W.-T., and Guntzer, U. Multi-objective query processing fordatabase systems. In International Conference on Very Large Data Bases(VLDB) (2004), pp. 936–947.

6. Balke, W.-T., Guntzer, U., and Siberski, W. Exploiting indifference forcustomization of partial order skylines. In International Database Engineeringand Applications Symposium (IDEAS) (2006), pp. 80–88.

7. Balke, W.-T., Guntzer, U., and Zheng, J. X. Efficient distributed skylin-ing for web information systems. In International Conference on ExtendingDatabase Technology (EDBT) (2004), pp. 256–273.

8. Barndorff-Nielsen, O., and Sobel, M. On the distribution of the numberof admissable points in a vector random sample. Theory of Probability and itsApplication 11, 2 (1966), 249–269.

9. Bartolini, I., Ciaccia, P., Oria, V., and Ozsu, M. T. Integrating theresults of multimedia sub-queries using qualitative preferences. In MultimediaInformation Systems (MIS) (2004), pp. 66–75.

10. Bartolini, I., Ciaccia, P., Oria, V., and Ozsu, M. T. Flexible integrationof multimedia sub-queries with qualitative preferences. Multimedia Tools andApplications Journal 33, 3 (2006), 275–300.

174 References

11. Bartolini, I., Ciaccia, P., and Patella, M. Salsa: Computing the sky-line without scanning the whole sky. In ACM International Conference onInformation and Knowledge Management (CIKM) (2006), pp. 405–414.

12. Bayer, R., and McCreight, E. M. Organization and maintenance of largeordered indices. Acta Informatica 1 (1972), 173–189.

13. Beckmann, N., Kriegel, H.-P., Schneider, R., and Seeger, B. The r*-tree: An efficient and robust access method for points and rectangles. In ACMSIGMOD International Conference on Management of Data (1990), pp. 322–331.

14. Belussi, A., and Faloutsos, C. Self-spacial join selectivity estimation usingfractal concepts. ACM Transactions on Information Systems 16, 2 (1998),161–201.

15. Bentley, J. L., Clarkson, K. L., and Levine, D. B. Fast linear expected-time algorithms for computing maxima and convex hulls. In ACM-SIAM Sym-posium on Discrete Algorithms (SODA) (1990), pp. 179–187.

16. Bentley, J. L., Kung, H. T., Schkolnick, M., and Thompson, C. D. Onthe average number of maxima in a set of vectors and applications. Journal ofthe ACM (JACM) 25, 4 (1978), 536–543.

17. Berchtold, S., Bohm, C., Jagadish, H. V., Kriegel, H.-P., and Sander,J. Independent quantization: An index compression technique for high-dimensional data spaces. In International Conference on Data Engineering(ICDE) (2000), pp. 577–588.

18. Berchtold, S., Bohm, C., Keim, D. A., and Kriegel, H.-P. A cost modelfor nearest neighbor search in high-dimensional data space. In ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS)(1997), pp. 78–86.

19. Berchtold, S., Bohm, C., Keim, D. A., Kriegel, H.-P., and Xu, X. Op-timal multidimensional query processing using tree striping. In InternationalConference on Data Warehousing and Knowledge Discovery (DAWAK) (2000),pp. 244–257.

20. Berchtold, S., Bohm, C., and Kriegel, H.-P. The pyramid-technique:Towards breaking the curse of dimensionality. In ACM SIGMOD InternationalConference on Management of Data (1998), pp. 142–153.

21. Berchtold, S., Keim, D. A., and Kriegel, H.-P. The x-tree: An indexstructure for high dimensional data. In International Conference on Very LargeData Bases (VLDB) (1996), pp. 28–39.

22. Beyer, K. S., Goldstein, J., Ramakrishnan, R., and Shaft, U. Whenis ”nearest neighbor” meaningful? In International Conference on DatabaseTheory (ICDT) (1999), pp. 217–235.

23. Bingham, E., and Mannila, H. Random projection in dimensionality re-duction: applications to image and text data. In ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining (2001), pp. 245–250.

24. Bohm, C. A cost model for query processing in high dimensional data spaces.ACM Transactions on Database Systems 25, 2 (2000), 129–178.

25. Bohm, C., Berchtold, S., and Keim, D. A. Searching in high-dimensional spaces: Index structures for improving the performance of mul-timedia databases. ACM Computing Surveys 33, 3 (2001), 322–373.

26. Borzsonyi, S., Kossmann, D., and Stocker, K. The skyline operator. InIEEE International Conference on Data Engineering (ICDE) (2001), pp. 421–430.

References 175

27. Bustos, B., Keim, D. A., Saupe, D., Schreck, T., and Vranic., D. V.Feature-based similarity search in 3d object databases. ACM Computing Sur-veys 37, 4 (2005), 345–387.

28. Chakrabarti, K., and Mehrotra, S. Local dimensionality reduction: A newapproach to indexing high dimensional spaces. In International Conference onVery Large Data Bases (VLDB) (2000), pp. 89–100.

29. Chan, C. Y., Eng, P.-K., and Tan, K.-L. Efficient processing of skylinequeries with partially-ordered domains. In IEEE International Conference onData Engineering (ICDE) (2005), pp. 190–191.

30. Chan, C. Y., Eng, P.-K., and Tan, K.-L. Stratified computation of skylineswith partially-ordered domains. In ACM SIGMOD International Conferenceon Management of Data (2005), pp. 203–214.

31. Chan, C. Y., Jagadish, H. V., Tan, K.-L., Tung, A. K. H., and Zhang,Z. On high dimensional skylines. In International Conference on ExtendingDatabase Technology (EDBT) (2006), pp. 478–495.

32. Chaudhuri, S., Dalvi, N. N., and Kaushik, R. Robust cardinality and costestimation for skyline operator. In IEEE International Conference on DataEngineering (ICDE) (2006), p. 64.

33. Chomicki, J. Preference formulas in relational queries. ACM Transactions onDatabase Systems (TODS) 28, 4 (2003), 427–466.

34. Chomicki, J., Godfrey, P., Gryz, J., and Liang, D. Skyline with presort-ing. In IEEE International Conference on Data Engineering (ICDE) (2003),pp. 717–816.

35. Chomicki, J., Godfrey, P., Gryz, J., and Liang, D. Skyline with pre-sorting: Theory and optimizations. In Intelligent Information Systems (IIS)(2005), pp. 595–604.

36. Ciaccia, P., Patella, M., and Zezula, P. M-tree: An efficient accessmethod for similarity search in metric spaces. In International Conferenceon Very Large Data Bases (VLDB) (1997), pp. 426–435.

37. Ciaccia, P., Patella, M., and Zezula, P. A cost model for similarityqueries in metric spaces. In ACM SIGACT-SIGMOD-SIGART Symposium onPrinciples of Database Systems (PODS) (1998), pp. 59–68.

38. Cleary, J. G. Analysis of an algorithm for finding nearest neighbors in eu-clidean space. ACM Transactions on Mathematical Software 5, 2 (1979), 183–192.

39. Dellis, E., and Seeger, B. Efficient computation of reverse skyline queries.In International Conference on Very Large Data Bases (VLDB) (2007).

40. Dellis, E., Seeger, B., and Vlachou, A. Nearest neighbor search on veri-cally partitioned high-dimensional data. In International Conference on DataWarehousing and Knowledge Discovery (DAWAK) (2005), pp. 243–253.

41. Dellis, E., Vlachou, A., Vladimirskiy, I., Seeger, B., and Theodoridis,Y. Constrained subspace skyline computation. In ACM International Con-ference on Information and Knowledge Management (CIKM) (2006), pp. 415–424.

42. den Bercken, J. V., Blohsfeld, B., Kramer, J., Schafer, T., andSeeger, B. Xxl - a library approach to supporting efficient implementationsof advanced database queries. In International Conference on Very Large DataBases (VLDB) (2001), pp. 39–48.

176 References

43. Deng, K., Zhou, X., and Shen, H. T. Multi-source skyline query processingin road networks. In IEEE International Conference on Data Engineering(ICDE) (2007), pp. 796–805.

44. Eng, P.-K., Ooi, B. C., and Tan, K.-L. Indexing for progressive skylinecomputation. Data and Knowledge Engineering 46, 2 (2003), 196–201.

45. Fagin, R., Lotem, A., and Naor, M. Optimal aggregation algorithms formiddleware. In ACM SIGACT-SIGMOD-SIGART Symposium on Principlesof Database Systems (PODS) (2001).

46. Faloutsos, C. Gray codes for partial match and range queries. IEEE Trans-actions on Software Engineering 14, 10 (1988), 1381–1393.

47. Faloutsos, C., and Lin, K.-I. Fastmap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In ACMSIGMOD International Conference on Management of Data (1995), pp. 163–174.

48. Faloutsos, C., Ranganathan, M., and Manolopoulos, Y. Fast subse-quence matching in time-series databases. In ACM SIGMOD InternationalConference on Management of Data (1994), pp. 419–429.

49. Faloutsos, C., Seeger, B., Traina, A. J. M., and Jr., C. T. Spatial joinselectivity using power laws. In ACM SIGMOD International Conference onManagement of Data (2000), pp. 177–188.

50. Ferhatosmanoglu, H., Stanoi, I., Agrawal, D., and Abbadi, A. E. Con-strained nearest neighbor queries. In International Symposium on Advances inSpatial Databases (SSTD) (2001), pp. 257–278.

51. Friedman, J. H., Bentley, J. L., and Finkel, R. A. An algorithm forfinding best matches in logarithmic expected time. ACM Transactions Math-ematical Software 3, 3 (1977), 209–226.

52. Gaede, V., and Gunther, O. Multidimensional access methods. ACM Com-puting Surveys 30, 2 (1998), 170–231.

53. Gao, Y., Chen, G., Chen, L., and Chen, C. Parallelizing progressive com-putation for skyline queries in multi-disk environment. In International Con-ference on Database and Expert Systems Applications (DEXA) (2006), pp. 697–706.

54. Gionis, A., Indyk, P., and Motwani, R. Similarity search in high dimensionsvia hashing. In International Conference on Very Large Data Bases (VLDB)(1999), pp. 518–529.

55. Godfrey, P. Skyline cardinality for relational processing. In Foundations ofInformation and Knowledge Systems (FoIKS) (2004), pp. 78–97.

56. Godfrey, P., Shipley, R., and Gryz, J. Maximal vector computation inlarge data sets. In International Conference on Very Large Data Bases (VLDB)(2005), pp. 229–240.

57. Goh, C. H., Lim, A., Ooi, B. C., and Tan, K.-L. Efficient indexing ofhigh-dimensional data through dimensionality reduction. Data and KnowledgeEngineering 32, 2 (2000), 115–130.

58. Gray, J., Bosworth, A., Layman, A., and Pirahesh, H. Data cube: A re-lational aggregation operator generalizing group-by, cross-tab and sub-total. InIEEE International Conference on Data Engineering (ICDE) (1996), pp. 152–159.

59. Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D.,Venkatrao, M., Pellow, F., and Pirahesh, H. Data cube: A relational ag-

References 177

gregation operator generalizing group-by, cross-tab and sub-total. Data Miningand Knowledge Discovery 1, 1 (1997), 29–53.

60. Gudivada, V. N. Spatial similarity measures for multimedia applications. InStorage and Retrieval for Image and Video Databases (SPIE) (1995), pp. 363–372.

61. Guntzer, U., Balke, W.-T., and Kießling, W. Optimizing multi-featurequeries for image databases. In International Conference on Very Large DataBases (VLDB) (2000), pp. 419–428.

62. Guttman, A. R-trees: A dynamic index structure for spatial searching. InACM SIGMOD International Conference on Management of Data (1984),pp. 47–57.

63. Haas, P. J., Naughton, J. F., Seshadri, S., and Stokes, L. Sampling-based estimation of the number of distinct values of an attribute. In Interna-tional Conference on Very Large Data Bases (VLDB) (1995), pp. 311–322.

64. Hadjieleftheriou, M., Kollios, G., Tsotras, V. J., and Gunopulos, D.Indexing spatiotemporal archives. VLDB Journal 15, 2 (2006), 143–164.

65. Henrich, A. A distance scan algorithm for spatial access structures. InACM Workshop on Advances in Geographic Information Systems (GIS) (1994),pp. 136–143.

66. Hinneburg, A., Aggarwal, C. C., and Keim, D. A. What is the nearestneighbor in high dimensional spaces? In International Conference on VeryLarge Data Bases (VLDB) (2000), pp. 506–515.

67. Hjaltason, G. R., and Samet, H. Ranking in spatial databases. In Interna-tional Symposium on Advances in Spatial Databases (SSTD) (1995), pp. 83–95.

68. Hjaltason, G. R., and Samet, H. Distance browsing in spatial databases.ACM Transactions on Database Systems 24, 2 (1999), 265–318.

69. Hose, K. Processing skyline queries in p2p systems. In Proceedings ofVLDB’05 PhD Workshop (2005), pp. 36–40.

70. Hose, K., Karnstedt, M., Koch, A., Sattler, K.-U., and Zinn, D.Processing rank-aware queries in P2P systems. In International Workshopon Databases, Information Systems and Peer-to-Peer Computing (DBISP2P)(2005), pp. 238–249.

71. Hose, K., Lemke, C., and Sattler, K.-U. Processing relaxed skylines inpdms using distributed data summaries. In ACM International Conference onInformation and Knowledge Management (CIKM) (2006), pp. 425–434.

72. Huang, X., and Jensen, C. S. In-route skyline querying for location-basedservices. In Web and Wireless Geographical Information Systems (W2GIS)(2004), pp. 120–135.

73. Huang, Z., Lu, C. S. J. H., and Ooi, B. C. Skyline queries against mobilelightweight devices in MANETs. In IEEE International Conference on DataEngineering (ICDE) (2006), p. 66.

74. Huang, Z., Lu, H., Ooi, B. C., and Tung, A. K. H. Continuous sky-line queries for moving objects. IEEE Transactions on Knowledge and DataEngineering (TKDE) 18, 12 (2006), 1645–1658.

75. Jagadish, H. V. Linear clustering of objects with multiple atributes. In ACMSIGMOD International Conference on Management of Data (1990), pp. 332–342.

76. Jagadish, H. V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik,K. C., and Suel, T. Optimal Histograms with Quality Guarantees. In Inter-national Conference on Very Large Data Bases (VLDB) (1998), pp. 275–286.

178 References

77. Jagadish, H. V., Ooi, B. C., Tan, K.-L., Yu, C., and Zhang, R. idistance:An adaptive b+-tree based indexing method for nearest neighbor search. ACMTransactions on Database Systems (TODS) 30, 2 (2005), 364–397.

78. Jin, W., Han, J., and Ester, M. Mining thick skylines over large databases.In International Conference on Principles and Practice of Knowledge Discoveryin Databases (PKDD) (2004), pp. 255–266.

79. Jolliffe, I. Principal component analysis. Springer Verlag (1986).80. Jr., C. T., Traina, A. J. M., Seeger, B., and Faloutsos, C. Slim-trees:

High performance metric trees minimizing overlap between nodes. In Interna-tional Conference on Extending Database Technology (EDBT) (2000), pp. 51–65.

81. Jr., C. T., Traina, A. J. M., Wu, L., and Faloutsos, C. Fast featureselection using fractal dimension. In SBBD (2000), pp. 158–171.

82. Kanth, K. V. R., Ravada, S., Sharma, J., and Banerjee, J. Indexingmedium-dimensionality data in oracle. In ACM SIGMOD International Con-ference on Management of Data (1999), pp. 521–522.

83. Katayama, N., and Satoh, S. The sr-tree: An index structure for high-dimensional nearest neighbor queries. In ACM SIGMOD International Con-ference on Management of Data (1997), pp. 369–380.

84. Kießling, W. Foundations of preferences in database systems. In Interna-tional Conference on Very Large Data Bases (VLDB) (2002), pp. 311–322.

85. Koltun, V., and Papadimitriou, C. H. Approximately dominating repre-sentatives. In International Conference on Database Theory (ICDT) (2005),pp. 204–214.

86. Korn, F., and Muthukrishnan, S. Influence sets based on reverse nearestneighbor queries. In ACM SIGMOD International Conference on Managementof Data (2000), pp. 201–212.

87. Korn, F., Pagel, B.-U., and Faloutsos, C. On the ’dimensionality curse’and the ’self-similarity blessing’. IEEE Transactions on Knowledge and DataEngineering (ICDE) 13, 1 (2001), 96–111.

88. Korn, F., Sidiropoulos, N., Faloutsos, C., Siegel, E., and Protopapas,Z. Fast nearest neighbor search in medical image databases. In InternationalConference on Very Large Data Bases (VLDB) (1996), pp. 215–226.

89. Kossmann, D., Ramsak, F., and Rost, S. Shooting stars in the sky: Anonline algorithm for skyline queries. In International Conference on Very LargeData Bases (VLDB) (2002), pp. 275–286.

90. Kruskal, J. B. On the shortest spanning subtree of a graph and the travellingsalesman problem. American Mathematical Society 7 (1956), 48–50.

91. Kung, H. T., Luccio, F., and Preparata, F. P. On finding the maxima ofa set of vectors. Journal of the ACM (JACM) 22, 4 (1975), 469–476.

92. Li, C., Ooi, B. C., Tung, A. K. H., and Wang, S. Dada: a data cube fordominant relationship analysis. In ACM SIGMOD International Conferenceon Management of Data (2006), pp. 659–670.

93. Lin, K.-I., Jagadish, H. V., and Faloutsos, C. The tv-tree: An indexstructure for high-dimensional data. VLDB Journal 3, 4 (1994), 517–542.

94. Lin, X., Yuan, Y., Wang, W., and Lu, H. Stabbing the sky: Efficient skylinecomputation over sliding windows. In IEEE International Conference on DataEngineering (ICDE) (2005), pp. 502–513.

References 179

95. Lin, X., Yuan, Y., Zhang, Q., and Zhang, Y. Selecting stars: the k mostrepresentative skyline operator. In IEEE International Conference on DataEngineering (ICDE) (2007), pp. 86–95.

96. Mouratidis, K., Bakiras, S., and Papadias, D. Continuous monitoring oftop-k queries over sliding windows. In ACM SIGMOD International Conferenceon Management of Data (2006), pp. 635–646.

97. Muthukrishnan, S., and Suel, T. Approximation algorithms for array par-titioning problems. Journal of Algorithms 54, 1 (2005), 85–104.

98. Ooi, B. C., Tan, K.-L., Yu, C., and Bressan, S. Indexing the edges -a simple and yet efficient approach to high-dimensional indexing. In ACMSIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems(PODS) (2000), pp. 166–174.

99. Orenstein, J. A. Spatial query processing in an object-oriented databasesystem. In ACM SIGMOD International Conference on Management of Data(1986), pp. 326–336.

100. Pagel, B.-U., Korn, F., and Faloutsos, C. Deflating the dimensionalitycurse using multiple fractal dimensions. In IEEE Transactions on Knowledgeand Data Engineering (ICDE) (2000), pp. 589–598.

101. Papadias, D., Tao, Y., Fu, G., and Seeger, B. An optimal and progressivealgorithm for skyline queries. In ACM SIGMOD International Conference onManagement of Data (2003), pp. 467–478.

102. Papadias, D., Tao, Y., Fu, G., and Seeger, B. Progressive skyline compu-tation in database systems. ACM Transactions on Database Systems (TODS)30, 1 (2005), 41–82.

103. Papadopoulos, A., and Manolopoulos, Y. Performance of nearest neigh-bor queries in r-trees. In International Conference on Database Theory (ICDT)(1997), pp. 394–408.

104. Pei, J., Fu, A. W.-C., Lin, X., and Wang, H. Computing compressedmultidimensional skyline cubes efficiently. In IEEE International Conferenceon Data Engineering (ICDE) (2007), pp. 96–105.

105. Pei, J., Jin, W., Ester, M., and Tao, Y. Catching the best views of skyline:A semantic approach based on decisive subspaces. In Very Large Data Bases(VLDB) (2005), pp. 253–264.

106. Preparata, F. P., and Shamos, M. I. Computational geometry: An intro-duction. Springer Verlag (1985).

107. Roussopoulos, N., Kelley, S., and Vincent, F. Nearest neighbor queries.In ACM SIGMOD International Conference on Management of Data (1995),pp. 71–79.

108. Sakurai, Y., Yoshikawa, M., Uemura, S., and Kojima, H. The a-tree: Anindex structure for high-dimensional spaces using relative approximation. InInternational Conference on Very Large Data Bases (VLDB) (2000), pp. 516–526.

109. Seidl, T., and Kriegel, H.-P. Optimal multi-step k-nearest neighbor search.In ACM SIGMOD International Conference on Management of Data (1998),pp. 154–165.

110. Sellis, T. K., Roussopoulos, N., and Faloutsos, C. The r+-tree: A dy-namic index for multi-dimensional objects. In Very Large Data Bases (VLDB)(1987), pp. 507–518.

111. Shaft, U., and Ramakrishnan, R. Theory of nearest neighbors indexability.ACM TODS 31, 3 (2006), 814–838.

180 References

112. Sharifzadeh, M., and Shahabi, C. The spatial skyline queries. In Interna-tional Conference on Very Large Data Bases (VLDB) (2006), pp. 751–762.

113. Singh, A., Ferhatosmanoglu, H., and Tosun, A. S. High DimensionalReverse Nearest Neighbor Queries. In ACM International Conference on In-formation and Knowledge Management (CIKM) (2003), pp. 91–98.

114. Sproull, R. F. Refinements to nearest-neighbor searching in k-dimensionaltrees. Algorithmica 6, 4 (1991), 579–589.

115. Stanoi, I., Agrawal, D., and Abbadi, A. E. Reverse Nearest NeighborQueries for Dynamic Databases. In ACM SIGMOD Workshop on ResearchIssues in Data Mining and Knowledge Discovery (2000), pp. 44–53.

116. Tan, K.-L., Eng, P.-K., and Ooi, B. C. Efficient progressive skyline compu-tation. In International Conference on Very Large Data Bases (VLDB) (2001),pp. 301–310.

117. Tao, Y., and Papadias, D. Maintaining sliding window skylines on datastreams. IEEE Transactions on Knowledge and Data Engineering (TKDE)18, 2 (2006), 377–391.

118. Tao, Y., Papadias, D., and Lian, X. Reverse kNN Search in Arbitrary Di-mensionality. In International Conference on Very Large Data Bases (VLDB)(2004), pp. 744–755.

119. Tao, Y., Xiao, X., and Pei, J. Subsky: Efficient computation of skylinesin subspaces. In IEEE International Conference on Data Engineering (ICDE)(2006), p. 65.

120. Tao, Y., Zhang, J., Papadias, D., and Mamoulis, N. An efficient costmodel for optimization of nearest neighbor search in low and medium dimen-sional spaces. IEEE Transactions on Knowledge and Data Engineering 16, 10(2004), 1169–1184.

121. Vlachou, A., Doulkeridis, C., Vazirgiannis, M., and Kotidis, Y.Skypeer: Efficient subspace skyline computation over distributed data. In IEEEInternational Conference on Data Engineering (ICDE) (2007), pp. 416–425.

122. Wang, S., Ooi, B. C., Tung, A. K. H., and Xu, L. Efficient skyline queryprocessing on peer-to-peer networks. In IEEE International Conference onData Engineering (ICDE) (2007), pp. 1126–1135.

123. Weber, R., Schek, H.-J., and Blott, S. A quantitative analysis and per-formance study for similarity-search methods in high-dimensional spaces. InInternational Conference on Very Large Data Bases (VLDB) (1998), pp. 194–205.

124. White, D. A., and Jain, R. Similarity indexing: Algorithms and performance.In Storage and Retrieval for Image and Video Databases (SPIE) (1996), pp. 62–73.

125. White, D. A., and Jain, R. Similarity indexing with the ss-tree. In IEEETransactions on Knowledge and Data Engineering (ICDE) (1996), pp. 516–523.

126. Wu, P., Agrawal, D., Egecioglu, O., and Abbadi, A. E. Deltasky: Opti-mal maintenance of skyline deletions without exclusive dominance region gener-ation. In IEEE International Conference on Data Engineering (ICDE) (2007),pp. 486–495.

127. Wu, P., Zhang, C., Feng, Y., Zhao, B. Y., Agrawal, D., and Abbadi,A. E. Parallelizing skyline queries for scalable distribution. In InternationalConference on Extending Database Technology (EDBT) (2006), pp. 112–130.

References 181

128. Xia, T., and Zhang, D. Refreshing the sky: the compressed skycube withefficient support for frequent updates. In ACM SIGMOD International Con-ference on Management of Data (2006), pp. 491–502.

129. Yang, C., and Lin, K.-I. An Index Structure for Efficient Reverse NearestNeighbor Queries. In IEEE International Conference on Data Engineering(ICDE) (2001), pp. 485–492.

130. Yu, C., Ooi, B. C., Tan, K.-L., and Jagadish, H. V. Indexing the distance:An efficient method to knn processing. In International Conference on VeryLarge Data Bases (VLDB) (2001), pp. 421–430.

131. Yuan, Y., Lin, X., Liu, Q., Wang, W., Yu, J. X., and Zhang, Q. Efficientcomputation of the skyline cube. In International Conference on Very LargeData Bases (VLDB) (2005), pp. 241–252.

132. Zhang, R., Ooi, B. C., and Tan, K.-L. Making the pyramid techniquerobust to query types and workloads. In International Conference on DataEngineering (ICDE) (2004), pp. 313–324.

Curriculum Vitae

Evangelos Dellis was born on September 10, 1977 in Trikala, Greece. Heattended primary school from 1981 to 1985, and high-school from 1985 to1995.

He entered the University of Athens, Greece in October 1995, studyingComputer Science. His diploma thesis was on ’Applying Multi-resolution Tech-niques on Quaternion Fractals’, supervised by Prof. Dr. T. Theoharis.

In December 2003, Evangelos Dellis started working at the University ofMarburg as a research assistant in the group of Prof. Dr. Bernhard Seeger, thechair of the teaching and research unit for Database Systems at the Depart-ment of Mathematics and Computer Science. His research interests includehigh-dimensional indexing, similarity search in large spatial and multimediadatabases, skyline query processing and ontology-driven spatial query pro-cessing.

Documents

Advanced Indexing and Query Processing for Multidimensional Databases