14
An efficient skyline framework for matchmaking applications $ Hyuck Han a , Hyungsoo Jung b, , Hyeonsang Eom a , H.Y. Yeom a a School of Computer Science and Engineering, Seoul National University, Seoul 151-742, South Korea b School of Information Technologies, University of Sydney, NSW 2006, Australia article info Article history: Received 31 March 2010 Received in revised form 19 August 2010 Accepted 25 August 2010 Keywords: Skyline computation R-tree Lattice Breadth-first search abstract In this article, we present a skyline-based matchmaking framework. The current method of carrying out the matchmaking procedure identifies items based on users’ specifications. We rethink matchmaking procedures in such a way that they can find items that can satisfy a specific computing demand from a user and recommend a collection of better candidates among the identified items. This endows a user with the right of choice on deciding the best-possible items. We approach the recommendation from the perspective of skyline computation and present an efficient skyline algorithm that gathers interesting item candidates efficiently. To devise an efficient sequential skyline algorithm, we adopt (i) lattice-based indexing using a lattice composition technique and (ii) an optimized dominance-check algorithm. Moreover, we parallelize the algorithm using breadth- first-search (BFS). Our extensive experimental results show that our algorithm outperforms current state-of-the-art algorithms, and the speedup factor of the parallelized algorithm is near-linear. & 2010 Elsevier Ltd. All rights reserved. 1. Introduction In the era of web services, a user typically specifies require- ments for interesting items and submits the specification through a matchmaking web service. Then, the matchmaking web service finds appropriate and available candidate items, and returns them to the user. In this context, we have an interesting question about ‘‘appropriate’’ items: which items are the best among thousands or millions of candidates to a user? We believe that candidates that have at least a single comparative advantage over other candidates are more appropriate. However, it is a non-trivial job to find the most appropriate candidates among millions of items. Example 1 (Item recommendation & skyline computation.). Table 1 shows a data set from a hotel reservation service. Each hotel has amenities such as a workout center and a parking lot represented as boolean attributes. The hotel also has price, star rating, and distance from the city center. From the table, users want candidates of hotels that satisfy users’ requirements ($80 rPrice r$110 and 0.5 mile r Distance from City Hall r 1.2 mile). Hawaii Inn is excluded in results since it does not meet users’ requirements. Although both Honolulu Inn and Kalawao Motel satisfy users’ needs, the user prefers Honolulu Inn to Kalawao Motel with the assumption that users prefer hotels with more stars and amenities, lower price, and shorter dis- tance. On the other hand, Kauai Sleep and Maui Motel can be recommended since both hotels are not worse than (not dominated by) other items. According to the example, recommending better items involves the same work as finding the maximals in a set of vectors (Godfrey et al., 2005), and the maximal vector problem is known as the skyline computation in the database society. The skyline operator that orzs ¨ onyi et al. (2001) introduced is used to find all better tuples instead of vectors that satisfy users’ needs. Thus, the skyline operator is a novel solution for the item recommendation for matchmaking procedures from a theoretical viewpoint. Definition 1 (Skyline operator orzs¨ onyi et al. (2001)). Given a set of data points, the skyline operator returns a set of points that are not dominated by other points. A point p i is said to dominate another point p j if p i is better than p j in at least one dimension and not worse than p j in other dimensions. The hotel skyline in Example 1 is composed of Honolulu Inn, Maui Motel and Kauai Sleep. The skyline for the hotel reservation is computed over a data set with multiple low- cardinality attributes (workout center, parking lot, and star rating) and multiple unrestricted attributes (price and distance). In this paper, we focus on the skyline computation over multiple low- cardinality and unrestricted attributes that are common in real-world applications. Morse et al. (2007) addressed the Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/jnca Journal of Network and Computer Applications 1084-8045/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.jnca.2010.08.011 $ A preliminary version Han et al. (2009) of this paper was presented at IEEE/ACM CCGrid 2009, Shanghai, China. Corresponding author. E-mail addresses: [email protected] (H. Han), [email protected] (H. Jung), [email protected] (H. Eom), [email protected] (H.Y. Yeom). Journal of Network and Computer Applications 34 (2011) 102–115

An efficient skyline framework for matchmaking applications

Embed Size (px)

Citation preview

Page 1: An efficient skyline framework for matchmaking applications

Journal of Network and Computer Applications 34 (2011) 102–115

Contents lists available at ScienceDirect

Journal of Network and Computer Applications

1084-80

doi:10.1

$A p

IEEE/AC� Corr

E-m

hyungso

yeom@s

journal homepage: www.elsevier.com/locate/jnca

An efficient skyline framework for matchmaking applications$

Hyuck Han a, Hyungsoo Jung b,�, Hyeonsang Eom a, H.Y. Yeom a

a School of Computer Science and Engineering, Seoul National University, Seoul 151-742, South Koreab School of Information Technologies, University of Sydney, NSW 2006, Australia

a r t i c l e i n f o

Article history:

Received 31 March 2010

Received in revised form

19 August 2010

Accepted 25 August 2010

Keywords:

Skyline computation

R-tree

Lattice

Breadth-first search

45/$ - see front matter & 2010 Elsevier Ltd. A

016/j.jnca.2010.08.011

reliminary version Han et al. (2009) of th

M CCGrid 2009, Shanghai, China.

esponding author.

ail addresses: [email protected] (H. Ha

[email protected] (H. Jung), hseom@cse

nu.ac.kr (H.Y. Yeom).

a b s t r a c t

In this article, we present a skyline-based matchmaking framework. The current method of carrying out

the matchmaking procedure identifies items based on users’ specifications. We rethink matchmaking

procedures in such a way that they can find items that can satisfy a specific computing demand from a

user and recommend a collection of better candidates among the identified items. This endows a user

with the right of choice on deciding the best-possible items.

We approach the recommendation from the perspective of skyline computation and present an

efficient skyline algorithm that gathers interesting item candidates efficiently. To devise an efficient

sequential skyline algorithm, we adopt (i) lattice-based indexing using a lattice composition technique

and (ii) an optimized dominance-check algorithm. Moreover, we parallelize the algorithm using breadth-

first-search (BFS). Our extensive experimental results show that our algorithm outperforms current

state-of-the-art algorithms, and the speedup factor of the parallelized algorithm is near-linear.

& 2010 Elsevier Ltd. All rights reserved.

1. Introduction

In the era of web services, a user typically specifies require-ments for interesting items and submits the specification througha matchmaking web service. Then, the matchmaking web servicefinds appropriate and available candidate items, and returns themto the user. In this context, we have an interesting question about‘‘appropriate’’ items: which items are the best among thousands ormillions of candidates to a user? We believe that candidates thathave at least a single comparative advantage over othercandidates are more appropriate. However, it is a non-trivial jobto find the most appropriate candidates among millions of items.

Example 1 (Item recommendation & skyline computation.). Table 1shows a data set from a hotel reservation service. Each hotel hasamenities such as a workout center and a parking lot representedas boolean attributes. The hotel also has price, star rating, anddistance from the city center. From the table, users wantcandidates of hotels that satisfy users’ requirements($80rPricer$110 and 0.5 mile r Distance from City Hall r1.2 mile). Hawaii Inn is excluded in results since it does not meetusers’ requirements. Although both Honolulu Inn and Kalawao

Motel satisfy users’ needs, the user prefers Honolulu Inn to

ll rights reserved.

is paper was presented at

n),

.snu.ac.kr (H. Eom),

Kalawao Motel with the assumption that users prefer hotelswith more stars and amenities, lower price, and shorter dis-tance. On the other hand, Kauai Sleep and Maui Motel can berecommended since both hotels are not worse than (notdominated by) other items.

According to the example, recommending better itemsinvolves the same work as finding the maximals in a set ofvectors (Godfrey et al., 2005), and the maximal vector problem isknown as the skyline computation in the database society. Theskyline operator that Borzsonyi et al. (2001) introduced is used tofind all better tuples instead of vectors that satisfy users’ needs.Thus, the skyline operator is a novel solution for the itemrecommendation for matchmaking procedures from a theoreticalviewpoint.

Definition 1 (Skyline operator Borzsonyi et al. (2001)). Given a setof data points, the skyline operator returns a set of points that arenot dominated by other points. A point pi is said to dominateanother point pj if pi is better than pj in at least one dimension andnot worse than pj in other dimensions.

The hotel skyline in Example 1 is composed of Honolulu Inn,Maui Motel and Kauai Sleep. The skyline for the hotelreservation is computed over a data set with multiple low-

cardinality attributes (workout center, parking lot, and star rating)and multiple unrestricted attributes (price and distance). In thispaper, we focus on the skyline computation over multiple low-cardinality and unrestricted attributes that are common inreal-world applications. Morse et al. (2007) addressed the

Page 2: An efficient skyline framework for matchmaking applications

Table 1Example data set.

Hotel name Parking available Workout center available Star rating Price Distance from city center

Hawaii Inn F F % 60 1.2

Honolulu Inn F T % % 80 0.9

Maui Motel F F % % 110 0.6

Kauai Sleep T T % % % 101 1.0

Kalawao Motel F T % % 101 1.1

H. Han et al. / Journal of Network and Computer Applications 34 (2011) 102–115 103

low-cardinality characteristics of such applications, and proposedan efficient skyline algorithm called the Lattice Skyline (LS)

algorithm. However, the LS algorithm cannot process the skylinecomputation over data with more than two (multiple) unrest-ricted attributes.

In the earlier version of this paper (Han et al., 2009), weproposed an efficient sequential algorithm to process skylinecomputation over data with multiple low-cardinality and unrest-ricted attributes in real-world applications. The basic idea of thealgorithm is to (i) transform multiple low-cardinality domainsinto a single lattice structure, (ii) construct a set of R-trees(Guttman, 1984) according to the lattice structure, and (iii)compute skylines from the R-trees with additional optimizedtechniques. We call our algorithm F ast I tem R ecommendationSKY line (FIR-SKY).1

In this paper, we extend our algorithm to process skyline querieswith users’ criteria (e.g., $80rPricer$110). Then, by using BFSinstead of topological sorting, we improve the algorithm to a parallelalgorithm, which was not addressed in our previous work. The maincontributions of this paper are as follows:

SKY

con

onc

app

We extend FIR-SKY to process conditional skyline queries thatspecify users’ criteria.

� We also improve our sequential algorithm to process skyline

queries in a parallel manner. The parallelized algorithm is alsoprogressive.2

We show that our algorithm outperforms current state-of-the-art algorithms through a performance evaluation, and thespeedup factor of the parallelized algorithm is near linear.

The rest of the paper is organized as follows. Section 2 reviewsthe related work in further detail. Section 3 explains the details ofthe FIR-SKY algorithm and improvements of FIR-SKY. Section 4evaluates our algorithms in comparison to current state-of-the-art algorithms. Finally, Section 5 concludes the paper.

2. Related work

A considerable amount of work has been published on thetopic of algorithms that compute skyline queries. The previouswork can be grouped into two categories, namely non-index-based (e.g., block nested loop Borzsonyi et al., 2001, divide andconquer Borzsonyi et al., 2001) and sort-based skyline algorithmsBartolini et al., 2008; Godfrey et al., 2005; Chomicki et al., 2003)and index-based (e.g., the B-tree-based scheme Borzsonyi et al.,2001, R-tree based schemes Borzsonyi et al., 2001; Papadias et al.,2005, bitmap Tan et al., 2001, and index Tan et al., 2001). Typically,

1 In the preliminary version of this paper, our algorithm was called F ast I tem

line (FI-SKY). In this article, we rename our algorithm in order to represent the

tents of the paper clearly.2 Progressiveness means that interesting candidates are returned immediately

e they are identified. Thus, progressiveness is a very important property for

lications that need a fast initial response.

the index-based approaches outperform the non-index-basedapproaches. Index-based approaches can also guarantee to returnanswers progressively without scanning whole data sets. Thisproperty is called ‘‘progressiveness’’ (Tan et al., 2001).

As the number of attributes increases, the number of skylinepoints generally increases, and this might lead to a burden ofmatchmaking applications. To mitigate this problem, severaltechniques to identify representative skyline points such as theskyline frequency (Chan et al., 2006b), k-dominant skyline (Chanet al., 2006a) and strong skyline points (Zhang et al., 2005) havebeen proposed. Techniques to evaluate skylines in subspaces of agiven space (i.e., subsets of attributes) are proposed in Yuan et al.(2005) and Pei et al. (2005, 2006), and they consider skylines inmultiple subspaces simultaneously to identify better skylineitems. In Lee et al. (2008), the MaxPrune framework is proposedto minimize the number of preference elicitation steps for skylinecomputations over categorical attributes. Thus, owing to mini-mum preference elicitation from users the framework has a greatadvantage when the amount of users’ preference is not adequate.In Lofi et al. (2010), the trade-off or compromise concept forskyline computation is first proposed. For example, users arewilling to pay an additional $1000 for a car without an airconditioner to install a new air-conditioning. To support thetrade-off concept, several algorithms including new indexing anddominance-check algorithms are devised. Lattice Skyline

(LS) (Morse et al., 2007) transforms multiple low-cardinalitydomains into a lattice structure and computes the skyline basedon the structure of the lattice. However, the LS algorithm can onlyprocess a skyline query with only one unrestricted attribute.

In our previous work (Han et al., 2009), we proposed anefficient sequential skyline algorithm for data with multiple low-cardinality and unrestricted attributes. In Jung et al. (2010), weextended our skyline algorithm to process skyline queries forpartially-ordered attributes which generally have more complexstructures than low-cardinality attributes. Additionally, theiMinMax-based (Yu et al., 2004) dominance-check algorithm isalso proposed. In this paper, we extend our work (Han et al., 2009)to process skyline queries with additional users’ conditions anddevise a parallel skyline algorithm to exploit recent popular CPUarchitectures such as SMP and multi-core processors.

Skyline computation can be also applied to applications ofdifferent contexts. In Lee and Teng (2007), skyline computation isused to process multi-criteria rating for recommendation systems. InPersonal Preference Search Service (Abel et al., 2007), skylinecomputation with partially-ordered attributes is used to search thebest learning resources. However, no specific skyline algorithm isproposed in either research work. In Kodama et al. (2009), twospatial skyline algorithms based on the nearest neighbor search areproposed for mobile environments. The algorithms use distanceinformation, categorical attributes, and unrestricted attributes toreturn the best candidates considering users’ current position.

Branch-and-Bound Skyline (BBS) (Papadias et al., 2005), whichis based on the R-tree (Guttman, 1984), is IO-optimal andoutperforms other proposed algorithms (Kossmann et al., 2002;Borzsonyi et al., 2001). Per Algorithm 1, BBS traverses the R-tree

Page 3: An efficient skyline framework for matchmaking applications

i h Saction heap contents S

Access root <e7,4>,<e6 Ø>6,

Expand e <e 5> <e 6> <e 8> <e 10> ØExpand e7 <e3,5>,<e6,6>,<e5,8>,<e4,10> Ø

Expand e3 <i,5>,<e6,6>,<h,7>,<e5,8>,<e 10> <g 11>

{i}

<e4,10>,<g,11>

Expand e6 <h,7>,<e5,8>,<e1,9>,<e4,10>,<g,11> {i}

Expand e1 <a 10> <e 10> <g 11> <b 12> <c 12> {i,a}p d e1 <a,10>,<e4,10>,<g,11>,<b,12>,<c,12> { , }

Expand e4 <k,10>,<g,11>,<b,12>,<c,12>,<l,14> {i,a,k}

1098765432100 1 2 3 4 5 6 7 8 9 10 11

e6

e2e1 e4 e5e3

e7

a b c d e f g h i l k m n

Fig. 1. An example of BBS (Papadias et al., 2005): (a) R-tree and (b) heap contents ð/entry id; L1 distanceSÞ.

H. Han et al. / Journal of Network and Computer Applications 34 (2011) 102–115104

or its family, searches the nearest hyper-rectangles or points thatare not dominated by any point of the current skyline, and insertsthem into the MinHeap. If the top entry point of the MinHeap isnot dominated by any other point of the current skyline set, thetop entry point is inserted into the set, and the set is updated.Because BBS visits entries in ascending order of their distancesfrom the origin, each computed point is guaranteed to be a skylinepoint (progressiveness).

Fig. 1 gives an example of BBS with the assumption thatsmaller values are preferable to larger ones. As explained inPapadias et al. (2005), BBS fetches the root node of the R-tree thatis resident in a disk and inserts all its entries (e6, e7) in a heap. Theentries in the heap are sorted according to their L1 distance (thesum of x and y values). Then, the entry with the minimumdistance (e7) is ‘‘expanded’’. That is, the entry (e7) is removed andits children (e3, e4, e5) are inserted. The next expanded entry is e3

that has the first nearest neighbor (i) of the origin. Then, i belongsto the skyline, and is inserted into the list S of skyline points. If anentry (e.g., e2 and h) is dominated by any skyline point (e.g., (i)),the entry is pruned. Otherwise, BBS removes the entry, and insertsthe children of the entry (internal node) or list S (leaf node) after adominance-check operation.

Algorithm 1. Algorithm BBS.

Algorithm BBS(R, S)Input R is an R-tree.

S is an intermediate set of skyline points.Output Set of skyline points.

1: In

itialize heap H to empty; 2: In sert all entries in the root node of R� into heap H;

3: w

hile (H is not empty) do 4: Remove top entry e from H; 5: if (e is an internal entry) then 6: if (e is not dominated by any entry in S) then 7: for each child entry ei of e do 8: if (ei is not dominated by any entry in S) then 9: Insert ei into H; 10: endfor 11: else 12: S ¼ UpdateSkylines(e, S); 13: e ndwhile 14: r eturn S;

Algorithm UpdateSkylines(e, S)Input e is a data point in some leaf node of an R-tree.

S is an intermediate set of skyline points.

Output an updated set S.

1: f

or each pAS do 2: if (e is dominated by p) then 3: return S; 4: e ndfor 5: In sert e into S; 6: r eturn S;

There is also a lot of research work that focuses onmatchmaking engines for Web Services (OASIS, 2010; Graham,2001a, 2001b) and Grid Services (Globus Alliance, 2005). TheANSAware Trader (Consortium, 2010) and the CORBA/ODPTrading Service (ISO, 1995; OMG, 2000) can be regarded as anadvanced form of a directory. Consumers in the virtual market-place (Hoffner et al., 2000; Facciorusso et al., 2003) provide a

Page 4: An efficient skyline framework for matchmaking applications

20

10

1

2

3

1.20

2.20

3.20

1.0

2.10

3.10

Fig. 2. Example of low-cardinality domains: (a) multiple low-cardinality domains and (b) transformation into a lattice.

H. Han et al. / Journal of Network and Computer Applications 34 (2011) 102–115 105

description of what they want and select candidates fromproviders. These works do not deal with skyline computation asitem recommendation. If such matchmaking applications includeskyline computation, it would be better since they have itemrecommendation as an advanced feature.

In this paper, we focus on skyline computation for data withmultiple unrestricted and low-cardinality attributes, which hasnot been addressed before. To this end, we devise a new skylinealgorithm (FIR-SKY) to process skyline queries over such data byextending BBS.

3. FIR-SKY algorithm

Throughout this paper, and without loss of generality, weconsider the skyline with the min operator for all attributes.3 Thismeans that smaller values dominate larger ones. To this end, wemap the value T and F into 0 and 1 in the case of booleanattributes. In the case of a numeric attribute, if larger values arepreferable to smaller ones, we simply transform each value of theattribute into the maximum value of the attribute—the original

value. Otherwise, we do not need any transformation.We first explain the lattice-based indexing technique. Then, we

describe the FIR-SKY algorithm that computes the skyline from alltuples, and we explain the way that the FIR-SKY algorithmprocesses skyline queries with user-specified conditions. Finally,we extend the FIR-SKY algorithm to a parallelized algorithm.

3.1. Lattice-based indexing technique

Before indexing data, we first preprocess all low-cardinalityattributes using a lattice. If a relation Ri has n attributes(nu unrestricted attributes and nl low-cardinality attributes),low-cardinality attributes of the relation Ri are modeled byfA1,A2, . . . ,Anl

g. The unified domain of the lattice LV is thegenuinely Cartesian-product space. Then, we transform the unifieddomain into a lattice structure using a Hasse diagram, which is awell-known diagram to form a drawing of the transitive reductionof the partial order. As an example, Fig. 2 (a) and (b) show thattwo low-cardinality attributes are transformed into a two-dimensional lattice. Each low-cardinality attribute is totally-ordered (10 dominates 20) while the two-dimensional lattice is

3 In fact, BBS visits entries of an R-tree in an ascending order of their distance

due to the assumption that smaller values are preferable to larger ones.

partially-ordered (the lattice value {1,20} dominates {2,20},while {1,20} is incomparable to {2,10}). Then, we index datausing the lattice. The basic indexing structure is a set of R-trees,i.e., RT ¼ fRT l1,RT l2, . . . ,RTlng, such that RTl is built on data whoselattice value is l. Thus, when an item is stored, an appropriateR-tree is first identified according to low-cardinality values andonly unrestricted attribute values are stored to the identified R-tree. For example, when the record (1,10,3,3) is stored, the part(3,3) of the record the R-tree RT(1,10) with the assumption that thefirst and the second attributes are low-cardinality attributes.

The set of all possible lattice values LV is the Cartesian-product

space of all low-cardinality attributes. Thus, the number of R-treesthat is required when maintaining nl low-cardinality attributes isthe size of LV, NLV. Our stratification technique does not lead toredundant tuples across R-trees because the lattice value of eachtuple is unique. Our method, thanks to the disjointness propertyof our stratification technique, does not consume more amountsof space than the BBS algorithm, the current state-of-the-artalgorithm. Rather, the total space consumed by a set of R-trees inFIR-SKY is less than that of an R-tree in BBS because thedimension of each R-tree in FIR-SKY is not n but nu.

3.2. FIR-SKY: BFS-based-skyline-computation

Our algorithm is based on a hybrid data structure, a lattice whereeach node has values from multiple low-cardinality domains and aset of R-trees. All the R-trees have the same unrestricted attributes.Since the lattice structure (Fig. 2(b)) is a kind of directed acyclicgraph (DAG), we use a breadth-first search (BFS) (Fig. 3) todetermine the visit order of R-trees when data points are extractedfrom the R-trees for skyline computation.

The BFS order of a DAG is a linear ordering of its nodes, and thislinear ordering makes our algorithm progressive.

Property 1. BFS visits each node before all nodes to which it has

edges if a given DAG has the lattice structure.

By the definition of BFS, BFS treats neighbors of a vertex va assuccessors of va, where neighbors are pointed by outgoing edgesof va. BFS begins at the root node and explores all the neighboringnodes. Then for each of those nearest nodes, it explores theirunexplored neighbor nodes until all connected nodes are visited.From Figs. 2(b) and 3, we can verify this property.

Lemma 1. An intermediate skyline point px that is found in the RTi

cannot be dominated by any data point in RTj, where i precedes j in

the BFS order.

Page 5: An efficient skyline framework for matchmaking applications

x

RT(1.10)

x

y

x

y

RT(1.20)

x

y

RT(2.10)

y

RT(2.20)

x

y

RT(3.10)

x

y

RT(3.20)

Fig. 3. BFS order of example R-trees (lattice of Fig. 2(b)).

H. Han et al. / Journal of Network and Computer Applications 34 (2011) 102–115106

Proof. We prove the lemma by contradiction. Assume that anintermediate skyline point pxART i is dominated by a data pointpyART j. If px is dominated by py, i is dominated by j. By Property 1,this dominance implies that i node is reachable from j node in thelattice structure, and the reachability makes j precede i whenapplying the BFS. This contradicts the assumption, and we thencomplete the proof. &

Example 2 (Progressiveness). Any skyline points from RT{1,20} inFig. 3 cannot be dominated from any points from RTj, where j

(e.g., {2,10} and {3,20}) follows {1,20}. This is because {1,20}is incomparable to {2,10} or preferable to {3,20}. Thus, afterskyline computation on RT{1,20}, all skyline points from RT{1,20} canbe returned immediately to users.

The linear ordering property of the BFS enables optimizationsin a ‘‘dominance-check’’ operation. The ‘‘dominance-check’’operation of Algorithm 1 (line 6 and 8 of BBS, and line 2of UpdateSkylines) involves comparisons with all existingskyline points. In the FIR-SKY algorithm, the ‘‘dominance-check’’operation needs comparisons to only relevant skylinepoints. Relevant skyline points of the skyline computationon an R-tree RTi are all skyline points from R-trees i.e.,RTk,RTkþ1, . . . ,RTkþa,RT i, where k,k+1,y,k+a,i are ancestors of i

or i itself in the lattice structure. We called the optimizeddominance-check ‘‘Dominance-Check-Over-Relevant-Sets’’. Anotherimportant optimization is that the Dominance-Check-Over-Re-levant-Sets is based on nu-dimensional comparisons (not n-dimensional comparisons).

Example 3 (Dominance-Check-Over-Relevant-Sets). Skyline com-putation on RT{1,20} in Fig. 2 does not need comparisons to skylinepoints from RT{2,10} and RT{3,10} since {1,20} is incomparable to{2,10} and {3,10}. The relevant sets of the computation are setsof intermediate skyline points from RT{1,10} and RT{1,20}. Moreover,the comparison in the skyline computation on RT{1,20} involvesonly unrestricted attributes since it is already known thatf1,20g$f1,10g, {1,20}.

Algorithm 2. Algorithm Dominance-Check-Over-Relevant-Sets.

Data StructuresLV : lattice values from low-cardinality domains

Algorithm Dominance-Check-Over-Relevant-Sets ðe,S,vÞ

Input e is an entry or a data point of an R-tree.

S is a set of disjoint groups of intermediate skylinepoints

v is the lattice value of e

Output true if e is dominated by any intermediate skylinepoints

false if e is incomparable to all current skyline pointsS ¼ fSujuALVg is a set of disjoint groups of intermediateskyline points

1: RS ¼ fSujuALV ,SuAS,u¼ v or u¼ ancestor of vg 2: for each RSARS do 3: for each pARS do 4: if (e is dominated by p) then //nu-dimensional

comparison

5: return true; 6: endfor 7: endfor 8: return false;

Algorithm 2 describes ‘‘Dominance-Check-Over-Relevant-Sets’’.It derives relevant skyline sets from all intermediate skylinepoints (line 1), and then it returns true if any skyline pointdominates the entry of an R-tree (line 2–7). Otherwise, it returnsfalse (line 8).

After determining the selection order of the R-trees by usingthe BFS, FIR-SKY computes a skyline set using the FIR-BBSalgorithm. In each stage, FIR-SKY selects an R-tree RTi in the BFSorder, and performs the skyline computation for each RTi. ByLemma 1, the order for each RTi enables intermediate skylinepoints from each RTi to be returned to a user immediately.

Algorithm 3 sketches the FIR-SKY algorithm. The Scheduler

determines the BFS order of all lattice values, and theorder represents the selection order of R-trees. The i-th run ofFIR-BBS with RTi traverses the R-tree RTi, then searches forthe nearest internal entry or data point(s) that are notdominated by any points of current skyline. If such points arefound, they are inserted into the MinHeap H. If the top entrypoint of the MinHeap is not dominated by any other point ofthe current skyline set, the top entry point is inserted into the set,and the set is updated. FIR-BBS and UpdateSkylines usethe Dominance-Check-Over-Relevant-Sets in order to checkwhether an internal entry or a data point is dominated by otherpoints of the current skyline set, and this minimizes thecomparison time.

Page 6: An efficient skyline framework for matchmaking applications

H. Han et al. / Journal of Network and Computer Applications 34 (2011) 102–115 107

Theorem 1. The algorithm FIR-SKY achieves the IO-optimality and

progressiveness in skyline computation.4

Algorithm 3. Algorithm FIR-SKY.

sc

Data StructuresLV : lattice values from low-cardinality domains

Algorithm FIR�SKYðRT ÞInput RT ¼ fRTvjvALVg is a set of R-treesOutput A set of skyline points

S ¼ fSvjvALVg is a set of disjoint groups of intermediate skylinepoints

1:

4

an

G, LO¼ SchedulerðRT Þ;

2: for i ¼ 1 to NLV do 3: S ¼ FIR�BBSðG½i�,S,LO½i�Þ; 4: endfor 5: return S

Algorithm SchedulerðRT ÞInput RT ¼ fRTvjvALVg is a set of R-treesOutput G, LO: a sorted array of R-trees and lattice values viaBFS

1:

LO ¼ BFS(LV); 2: for i ¼1 to NLV do 3: G½i� ¼ RTv, where v¼ LO½i�ALV

4:

endfor 5: return G, LO

Algorithm FIR�BBSðRTv,S,vÞ

Input RTv is an R-tree for vALV

S is a set of disjoint groups of intermediate skylinepoints

v is the lattice id of the current iterationOutput S is an updated skyline

FIR-BBS is the same as the BBS algorithm in Algorithm 1,except that (i) each ‘‘dominated’’ comparison is replacedby a Dominance-Check-Over-Relevant-Sets operation, and(ii) UpdateSkylines is replaced by the new UpdateSkylinesbelow.

Algorithm UpdateSkylinesðe,S,vÞ

Input e is a data point in some leaf node of an R-tree.S is a set of disjoint groups of intermediate skyline

pointsv is the lattice id of e

Output an updated skyline set S.

1:

if ðDominance-Check-Over-Relevant-Setsðe,S,vÞ ¼ trueÞthen

2:

return S; 3: Insert e into Sv; 4: return S;

Proof. The main routine of FIR-BBS is the same as BBS that hasproven to be IO-optimal and progressive (Papadias et al., 2005).So, the progressiveness and the IO-optimality of FIR-BBS areinherited from those of BBS. By Lemma 1, there does not exist anintermediate skyline point that can be removed from existingskyline sets, and this guarantees the progressiveness of FIR-SKY.

The IO-optimality means that the skyline query processing requires only one

of whole tuples.

The BFS order guarantees an exact-one-invocation of FIR-BBS oneach R-tree RTi, and this prevents FIR-SKY from reading the samenode twice from the disk (IO-optimality). &

3.3. Conditional skyline query

Users may submit skyline queries with user-provided condi-tions. For example, users want to search interesting cars with lessthan 20000 miles of mileage and 20000 dollars. So, FIR-SKYshould process skyline computation with specific conditions.User-provided conditions can be grouped into two types: a donot-care condition and a qualifiable condition.

Don’t-care conditions mean that attributes that users want toignore are excluded from skyline computation (e.g., irrespective ofa mileage or a workout center). In the case of unrestrictedattributes, FIR-SKY computes L1 distance without the unrestrictedattributes. This does not violate the progressiveness and the IOoptimality. Additionally, we modify Dominance-Check-Over-

Relevant-Sets operation slightly. When Dominance-Check-Over-

Relevant-Sets examines an entry of an R-tree, it uses the onlyunrestricted attributes without don’t-care conditions. In the caseof low-cardinality attributes, FIR-SKY excludes the low-cardin-ality attributes when it performs BFS. Then, FIR-SKY inserts rootsof multiple trees that represent lattice values, and performs therest of the procedures. Fig. 4 gives an example. Since Attr. 3 isthe don’t-care condition, the lattice value {0,1} means twoR-trees for lattice values {0,1,0} and {0,1,1}. When FIR-SKYprocesses the lattice value {0,1}, FIR-SKY inserts roots of the twoR-trees and performs the FIR-BBS algorithm.

Qualifiable conditions mean that range conditions or exactmatches for specific attributes are included in skyline computation(e.g., with a parking lot and more than two stars). In the case ofunrestricted attributes, FIR-SKY handles qualifiable conditionsthrough slightly modified Dominance-Check-Over-Relevant-Sets andFIR-BBS operation. When Dominance-Check-Over-Relevant-Sets ex-amines an entry of an R-tree, it checks whether the (internal or leaf)entry satisfies qualifiable conditions as well as whether the entry isdominated. The FIR-BBS procedure does not also push any entry thatdoes not satisfy qualifiable conditions. In the case of low-cardinalityattributes, FIR-SKY simply processes BFS with the reflection ofqualifiable conditions, and performs the rest of the procedures. As anexample, a qualifiable condition might be given that the value ofAttr. 3should be 1in Fig. 4(a). Under the condition, the schedulingorder is f0,0,1g-f1,0,1g-f0,1,1g-f1,1,1g. Algorithm 4 givesdescription of our algorithm for conditional skyline queries withboth do not-care and qualifiable conditions.

3.4. Parallelization of FIR-SKY

If all attributes are independent, the expected number ofskyline points is s¼YððlnNÞd�1=ðd�1Þ!Þ (Buchta, 1989) where N

and d are the number of data items and attributes. As the data sizeand the data dimensionality increase, the skyline set sizeincreases quickly. This leads to more dominance check operationsand the skyline computation cost of FIR-SKY also increasesquickly. From Fig. 5, we can verify this.

Recently, many CPU vendors integrate multiple cores into asingle processor instead of increasing CPU clock to improve CPUperformance. Thus, in the multicore era, skyline computation canbe a good candidate for parallelization since skyline computationis computationally intensive in nature. To provide a more efficientquery processing framework, we devise a parallelized version ofFIR-SKY. The key idea of the parallelized version is the extensionof the scheduling procedure: (i) to stratify R-trees into severalgroups according to node depth of each R-tree and (ii) to process a

Page 7: An efficient skyline framework for matchmaking applications

1

0

1

0

1

0

Attr. 1

0.0

0.1 1.0

1.1Attr. 2 Attr. 3

Fig. 4. Example of do not-care low-cardinality attributes: (a) three low-cardinality attributes and (b) lattice based on Attr. 1 and Attr. 2.

0

30

60

90

120

4 5 6

Com

puta

tion

Tim

e

# of Unrestricted Attributes

0

30

60

90

120

1 2 4 8 16 32 64 128

Com

puta

tion

Tim

e

# of Data Points (x 10K)

Fig. 5. Computation time of FIR-SKY (seconds, uniform distribution, 5 low-cardinality attributes): (a) varying # of unrestricted attributes (1280k data items) and

(b) varying # of data items (6 unrestricted attributes).

H. Han et al. / Journal of Network and Computer Applications 34 (2011) 102–115108

skyline query over the R-trees of the same node depth in a parallelmanner.

Property 2. If any two nodes of a given lattice structure have the

same depth in the BFS tree of the lattice, both are incomparable to

each other.

Let us verify this property by contradiction. That is, we canassume that there exist two nodes of the lattice structure with thesame depth in the BFS tree but one of the two nodes is dominatedby the other. In a lattice structure, all nodes that have the samedepth of the BFS tree share at least one of same parent nodes.Since the two nodes are siblings, they are not reachable from eachother. This means that the two are not dominated by each other,and the contradiction is broken.

Algorithm 4. FIR-SKY algorithm for conditional skyline queries.

Data StructuresLV : lattice values from low-cardinality domains

Algorithm FIR�SKYðRT ÞInput RT ¼ fRTvjvALVg is a set of R-treesOutput A set of skyline points

S ¼ fSvjvALVg is a set of disjoint groups of intermediate skylinepoints

1: G

, LO¼ SchedulerðRT Þ; 2: f or i ¼ 1 to NLV do 3: S ¼ FIR�BBSðG½i�,S,LO½i�Þ;(in the case of do not-care conditions on unrestrictedattributes, the FIR-SKY algorithm inserts roots of multiple treesthat represent each lattice value.) 4: e ndfor 5: r eturn S

Algorithm SchedulerðRT ÞInput RT ¼ fRTvjvALVg is a set of R-treesOutput G, LO: a sorted array of R-trees and lattice values via BFS

Scheduler is the same as the Scheduler algorithm inAlgorithm 3, except that(i) in the case of do not-care conditions on unrestricted

attributes, the BFS function computes the order excludingthe attributes with do not-care conditions, and

(ii) in the case of qualifiable conditions on unrestrictedattributes, the BFS function computes the order by usingonly the attributes values that satisfy qualifiableconditions.

Algorithm FIR�BBSðRTv,S,vÞ

Input RTv is an R-tree for vALV

S is a set of disjoint groups of intermediate skylinepoints

v is the lattice id of the current iterationOutput S is an updated skylineFIR-BBS is the same as the FIR-BBS algorithm in Algorithm 3,except that(i) in the case of do not-care conditions on unrestricted

attributes, the algorithm computes L1 distance excludingunrestricted attributes, and

(ii) in the case of qualifiable conditions on unrestrictedattributes, the algorithm pushes an (internal or leaf) entrythat satisfies qualifiable conditions into the MinHeap.

Algorithm UpdateSkylinesðe,S,vÞ

Input e is a data point in some leaf node of an R-tree.S is a set of disjoint groups of intermediate skyline

pointsv is the lattice id of e

Page 8: An efficient skyline framework for matchmaking applications

Fig. 6. Example of the BFS tree and the classification: (a) example lattice; (b) the BFS tree of the example lattice; and (c) stratification by the node depth.

H. Han et al. / Journal of Network and Computer Applications 34 (2011) 102–115 109

Output an updated skyline set S.UpdateSkylines is the same as the UpdateSkylines algorithmin Algorithm 3.

Algorithm Dominance�Check�Over�Relevant�Setsðe,S,vÞ

Input e is an entry or a data point of an R-tree.S is a set of disjoint groups of intermediate skyline pointsv is the lattice value of e

Output true if e is dominated by any intermediate skyline pointsfalse if e is incomparable to all current skyline points

Dominance-Check-Over-Relevant-Sets is the same as theDominance-Check-Over-Relevant-Sets algorithm inAlgorithm 2, except that(i) in the case of do not-care conditions on unrestricted

attributes, the algorithm examines an entry of an R-treeexcluding unrestricted attributes, and

(ii) in the case of qualifiable conditions on unrestrictedattributes, the algorithm checks whether an (internal orleaf) entry satisfies qualifiable conditions as well aswhether the entry is dominated.

From Fig. 6(b), we can see that this property is correct. Forexample, the node depth of (2,20) and (3,10) is 2 and the bothnodes are incomparable to each other.

From this property, parallel invocation of the FIR-BBS procedurein Algorithm 3 is allowed.

Lemma 2. Any data point in RTi cannot be dominated by any data

point in RTj, where i and j nodes of a given lattice structure have the

same depth in the BFS tree.

Proof. We prove the lemma by contradiction. Assume that anintermediate skyline point pxART i is dominated by a data pointpyART j. If px is dominated by py, i is dominated by j. By Property 2,this dominance implies that the depth of i node is different fromthat of j node in the BFS tree. This contradicts the assumption, andwe then complete the proof. &

Example 4. In the scheduling order of Fig. 3, we can see that theskyline computation on RT{1,20} and RT{2,10} is parallelizable since(1,20)is incomparable to (2,10). And, the depth of the node(1,20)in Fig. 6 is the same as that of the node (2,10).

Algorithm 5. Parallel FIR-SKY algorithm.

Data StructuresLV : lattice values from low-cardinality domainsHT is the height of the BFS tree.

Page 9: An efficient skyline framework for matchmaking applications

Table 2Experimental parameters and values.

Parameter Values

# of unrestricted attributes 2–6

# of low-cardinality attributes 4–5

Data size (�10k) 1–128

H. Han et al. / Journal of Network and Computer Applications 34 (2011) 102–115110

Algorithm FIR�SKYðRT ÞInput RT ¼ fRTvjvALVg is a set of R-treesOutput A set of skyline points

S ¼ fSvjvALVg is a set of disjoint groups of intermediate skylinepoints

Lattice size (NLV) 625, 1024

1: G , LO¼ SchedulerðRT Þ; Attribute correlation Uniform, anti-correlated 2: f or i ¼ 1 to HT do Disk page size and Node size of an R-tree 4 kB 3: for all j in parallel do

4:

S ¼ FIR�BBSðG½i�½j�,S,LO½i�½j�Þ; 5: endfor 6: e ndfor 7: r eturn S

Algorithm SchedulerðRT ÞInput RT ¼ fRTvjvALVg is a set of R-treesOutput G, LO: a sorted array of stratified R-trees and latticevalues

1: L

O ¼ BFS_with_classification(LV); 2: f or i ¼ 1 to NLV do 3: hti ¼ the depth of LO[i] 4: Append RTi to G½hti�

5: e

ndfor 6: r

5 The sorting function of SaLSa is minC (minimum coordinate sort) since in

Bartolini et al. (2008) the minC sorting function is known to outperform other

sorting functions.6 In this article, IO time means time for R-tree or any index operations.

However, SaLSa does not have any index structure. Thus, we do not present IO

time for SaLSa. In our implementation for experiments, sorted data for SaLSa reside

in main memory.

eturn G, LO

FIR-BBS and Algorithm UpdateSkylines are the same as thosein Algorithm 3.

Based on Lemma 2, we modify the Scheduler and the FIR-SKY

procedures in Algorithm 3 to execute the FIR-BBS procedurein parallel. Algorithm 5 sketches the parallel FIR-SKY algorithm.The Scheduler stratifies the BFS tree according to the depth ofeach node. Fig. 6 (c) shows an example of such classification.The FIR-SKY invokes the FIR-BBS procedure to be executedconcurrently per each stratum. Note that the total number ofstrata is the same as the height of the BFS tree (see Fig. 6).

Theorem 2. The parallel FIR-SKY algorithm also achieves the

IO-optimality and progressiveness in skyline computation.

Proof. By Lemmas 1 and 2, there does not exist an intermediateskyline point that can be removed from existing skyline sets. Thisguarantees the progressiveness of the parallel FIR-SKY algorithm.Since the parallel FIR-SKY also executes exact-one-invocation ofFIR-BBS on each R-tree, the parallel FIR-SKY does not read thesame node twice from the disk. &

4. Performance evaluation

In this section, we give performance results detailing theefficiency and progressiveness of our algorithms. Section 4.1describes the experimental environment. Sections 4.2 and 4.3explain the details of experimental results. Section 4.4 discussesthe usability of our algorithms.

4.1. Experimental environment

The experiments were performed on a workstation with twoquad-core processors and 8 GB of main memory, which wasrunning the SUSE Linux operating system. We used synthetic datasets described in Borzsonyi et al. (2001). We have compared ouralgorithm to BBS and SaLSa (for Sort and Limit Skyline algorithm).BBS and SaLSa are known as the best index-based solution(Papadias et al., 2005) and the best sort-based algorithm (Bartolini

et al., 2008),5 respectively. BBS uses an R-tree to index the inputdata. When skyline computation is performed over the R-treeindex, BBS finds the nearest neighbor to origin, prunes searchspace and repeats find the nearest neighbor in unpruned space.SaLSa is a skyline algorithm that pre-sorts the input data in orderto effectively limit the number of tuples to be read and compared.SaLSa reads one record from the pre-sorted input data, verifies ifthe record can be included in the skyline set. All algorithms usedin the experiments have been implemented based on the Javaversion of the spatial index library (Hadjieleftheriou, 2010) andApache Axis (Apache Web Services Project, 2010). Table 2 showsthe parameters and values used in the experiments. It is notedthat throughout this section an unrestricted attribute and low-cardinality attribute is abbreviated to U and L, respectively.

In unrestricted attributes, values whose domain is (0.0, 1000.0]were generated by our synthetic data generator described inBorzsonyi et al. (2001). We conducted the experiments by varyingthe number of data, the number of attributes, and the structure ofa lattice. It is noted that due to similar behaviors we skip results ofconditional skyline queries and correlated distribution cases.

4.2. IO and computation efficiency

We evaluate the performance of FIR-SKY, measuring IO6 andcomputation time by varying the number of data points. We thenbreak the measured time down into IO and computation time.From Figs. 7 and 8, we can see that FIR-SKY outperforms BBS andSaLSa. Table 3 shows the number of skyline in both experiments.From the table, a larger number of data points generates moreskyline points and this leads to more computation time.

The dimension of R-tree in BBS is n, while that in FIR-SKY is nu.Thus, the difference of the dimensionality between BBS andFIR-SKY is 4 or 5 in cases of Figs. 7 and 8. The lattice-based indexingmethod mitigates the effect of ‘‘the curse of dimensionality’’problem, and leads to a lower height of R-trees in FIR-SKY thanBBS. The lower dimension and height of R-trees in FIR-SKY incurlower IO overhead than BBS. FIR-SKY improves IO performance by10 and 12 times in 4 and 5 unrestricted attributes cases,respectively.

Figs. 7 and 8 also show that FIR-SKY has shorter computationstime than BBS and SaLSa. FIR-SKY uses Dominance-Check-Over-

Relevant-Sets, and Dominance-Check-Over-Relevant-Sets is basedon comparisons to only relevant skyline points and nu-dimen-sional comparisons. The two main advantages of Dominance-

Check-Over-Relevant-Sets lead to the superiority of FIR-SKY. Due tosuperior computation efficiency, FIR-SKY computes skylines 7 and4 times (15 and 8 times) faster than BBS (SaLSa) in 4 and 5

Page 10: An efficient skyline framework for matchmaking applications

Fig. 7. Results of 4U & 4L and 5U & 4L cases (seconds, uniform distribution): (a) IO time (4U & 4L); (b) IO time (5U & 4L); (c) computation time (4U & 4L); and

(d) computation time (5U & 4L).

Fig. 8. Results of 2U & 4L (seconds, anti-correlated): (a) IO time and (b) computation time.

Table 3The size of skyline: (A) case of 4U & 4L, (B) case of 5U & 4L, and (C) case of 2U & 4L.

# of data points

(�10k)

1 2 4 8 16 32 64 128

A 1020 1436 2046 2838 3500 4511 5162 6111

B 1910 2691 3885 5450 7157 9778 12,328 16,440

C 1161 1684 2219 3011 4263 6211 8211 12,225

H. Han et al. / Journal of Network and Computer Applications 34 (2011) 102–115 111

unrestricted attributes cases in uniform distribution cases,respectively. In anti-correlated cases, FIR-SKY processes skylinequeries 7 times (36 times) faster than BBS (SaLSa).

From the observations, we can conclude that FIR-SKY com-putes the skyline much faster than BBS and SaLSa. Moreover,FIR-SKY is a more scalable algorithm than BBS and SaLSa sinceFIR-SKY shows stable performance even though the number oftuples increases. Note that FIR-SKY behaves similarly on proces-sing conditional skyline queries and processing skyline queriesover data sets with other distributions.

Discussion: Some might doubt the scalability of our approach; aset of R-trees in FIR-SKY appears to consume huge amounts ofspace. Our stratification technique does not lead to redundanttuples across trees because of the disjointedness property of thestratification. Rather, our approach does not store values of low-cardinality attributes since each R-tree in FIR-SKY is built for eachvalue of lattice values. The total space consumed by a set ofR-trees in FIR-SKY, therefore, is less than that of an R-tree in BBS.

In summary, FIR-SKY improves the IO and computationefficiency through the lattice-based indexing and optimizeddominance check technique. The lattice-based indexing techniqueachieves IO efficiency since values of low-cardinality domains arenot stored and this mitigates the effect of ‘‘the curse ofdimensionality’’ problem. The optimized dominance checktechnique improves the computation efficiency greatly becausethe dominance check operation involves relevant intermediateskyline points rather than all intermediate skyline points.

4.3. Other experiments

Effects of increasing low-cardinality attributes: Figs. 9 and 10show IO and computation time of FIR-SKY when the number oflow-cardinality attributes increases. As the number of low-cardinality attributes increases, the probability that a point willnot be dominated by existing skyline points also increases. Fromthe analysis in Buchta (1989), we can verify this from theoreticalviewpoints. Thus, the increase of low-cardinality attributes leadsto more skyline points as shown in Table 4. More skyline pointsmean more invocations of dominance check operations, and ouroptimized technique enables FIR-SKY to finish the skylinecomputation faster than BBS and SaLSa.

As the difference of the dimensionality between BBS and FIR-SKY increases, the difference in IO time between BBS and FIR-SKYincreases. From Figs. 9(a), (b), and 10(a), we can see a bigdifference in IO time between BBS and FIR-SKY in 4 and 5unrestricted attributes cases (13 and 18 times better than BBS),

Page 11: An efficient skyline framework for matchmaking applications

Fig. 9. Effects of increasing low-cardinality attributes (seconds, uniform): (a) IO time (4U & 5L); (b) IO time (5U & 5L); (c) computation time (4U & 5L); and (d) computation

time (5U & 5L).

Fig. 10. Effects of increasing low-cardinality attributes (2U & 5L, seconds, anti-correlated): (a) IO time and (b) computation time.

Table 4The size of skyline: (A) case of 4U & 5L, (B) case of 5U & 5L, and (C) case of 2U & 5L.

# of data points

(�10k)

1 2 4 8 16 32 64 128

A 1402 1699 2387 3445 4001 5433 6513 7996

B 2091 3320 4873 6570 8608 12,250 15,710 20,462

C 1386 1917 2698 3673 5142 7571 10,639 15,041

Table 5Pre-processing time of FIR-SKY (seconds).

4 Low-cardinalityattributes

5 Low-cardinalityattributes

Pre-processing time 2.145 3.981

H. Han et al. / Journal of Network and Computer Applications 34 (2011) 102–115112

and this difference is much bigger than that in Figs. 7(a), (b), and8(a). From Figs. 9(c) and (d), we can see that FIR-SKY computesskylines 10 and 11 times (25 and 21 times) faster than BBS (SaLSa)in 4 and 5 unrestricted attributes cases, respectively. In the anti-correlated case, FIR-SKY processes skyline queries 18 times(73 times) faster than BBS (SaLSa).

Table 5 shows pre-processing time of FIR-SKY. Since the pre-processing procedure in FIR-SKY is only BFS, the pre-processingtime of FIR-SKY lies within a small range.

Effects of increasing unrestricted attributes: Figs. 11 and 12, andTable 6 show the results when the number of unrestricted

attributes increases. A large number of unrestricted attributesincreases the size of a skyline set, due to the high probability thata data point will become a skyline point. In both cases of 4 and 5low-cardinality attributes, the difference in IO time between FIR-SKY and BBS gets larger (16–20 times in 4 and 5 low-cardinalityattributes) than that in the results of the previous section. FromFigs. 11(c) and (d), we can see that FIR-SKY computes skylines 4.5and 6.7 times (7 and 10 times) faster than BBS (SaLSa) in 4 and 5low-cardinality attributes cases, respectively. In the anti-corre-lated case, FIR-SKY processes skyline queries 16 times (21 times)faster than BBS (SaLSa).

It is noted that due to similar behaviors we report only resultsof uniform distribution in following experiments.

Progressiveness: Fig. 13 shows how fast each algorithm returnsall answers progressively. To this end, we recorded the time ittook to output each portion of the answers. In BBS and SaLSa, thetime slope rises more steeply than that of FIR-SKY because of lessefficient computation. In contrast, FIR-SKY shows more efficientbehavior due to the optimized dominance check. With thisobservation, we can validate the efficiency of FIR-SKY inprogressive skyline computation.

Effect of optimized dominance check: Fig. 14 shows thecomputation time of FIR-SKY with and without the optimizeddominance check operation. FIR-SKY+ and FIR-SKY- represent FIR-SKY implementations with and without ‘‘Dominance-Check-Over-Relevant-Sets’’ algorithm. In both figures, FIR-SKY+ outperformsFIR-SKY- by 9 and 12 times since FIR-SKY- have less pruningeffect than FIR-SKY+ . These results validate the excellence of

Page 12: An efficient skyline framework for matchmaking applications

Fig. 11. Effects of increasing unrestricted attributes (seconds, uniform distribution): (a) IO time (6U & 4L); (b) IO time (6U & 5L); (c) computation time (6U & 4L); and

(d) computation time (6U & 5L).

Fig. 12. Effects of increasing unrestricted attributes (3U & 4L, seconds, anti-correlated): (a) IO time and (b) computation time.

Table 6The size of skyline: (A) case of 6U & 4L, (B) case of 6U & 5L, and (C) 3U & 4L.

# of data points (�10k) 1 2 4 8 16 32 64 128

A 3157 4855 7992 11,984 14,743 21,369 29,210 40,365

B 3784 5992 9641 14,073 18,296 26,024 35,918 50,284

C 6338 10,810 17,930 29,157 47,165 76,175 120,811 191,232

Fig. 13. Progressiveness (# of tuples: 320k): (a) 5 unrestricted & 4 low-cardinality attributes and (b) 5 unrestricted & 5 low-cardinality attributes.

H. Han et al. / Journal of Network and Computer Applications 34 (2011) 102–115 113

‘‘Dominance-Check-Over-Relevant-Sets’’ and indicate that thepruning technique adopted in the dominance check operationworks efficiently.

Effect of parallelization: For this experiment, we adopt themaster-worker style to implement the parallelized FIR-SKY

algorithm. In each trial, a fixed number of threads are createdbeforehand. For a given depth in the BFS tree, all nodes with thesame depth are inserted into an atomic queue on whichoperations are atomic. Each thread consumes a queue item andexecutes skyline computation over the corresponding R-tree.

Page 13: An efficient skyline framework for matchmaking applications

Fig. 14. Effects of ‘‘Dominance-check-over-relevant-sets’’: (a) 5 unrestricted & 4 low-cardinality attributes and (b) 5 unrestricted & 5 low-cardinality attributes.

Fig. 15. Effects of parallelization (# of tuples: 320k).

H. Han et al. / Journal of Network and Computer Applications 34 (2011) 102–115114

All threads iteratively perform such procedure until the queue isempty. When the queue is empty and all threads are suspended,all nodes with the following depth are inserted into the queue andthe above procedure is repeated.

Fig. 15 shows the computation time of FIR-SKY with increasingthe number of threads. From the figure, we can see that ourparallelization technique based on the BFS with stratification isvery effective. In the ‘‘8 threads’’ case, the speedup factor (¼Ts/Tp)is 7.9 (nearly 8). Generally, when the number of low-cardinalitydomains increases, the number of lattice values per a stratumincreases, and this enables parallel invocation of FIR-BBS proce-dures. From this observation, even heavy skyline queries can beperformed efficiently by parallel FIR-SKY.

7 For example, the dominance region of a skyline point (px,py) is ðpx oxr1,

py oyr1Þ.

4.4. Discussion

It is certain that as the number of attributes or data itemsincreases the skyline size increases to the point of burdening itemrecommendation applications. In this sense, skyline computationmight be not appropriate for item recommendation due to a largeset of skyline points. To mitigate this problem (reduce the skylineset), various research work is performed to identify moreimportant skyline points. For example, the skyline frequency(Chan et al., 2006b), k-dominant skyline (Chan et al., 2006a), andstrong skyline (Zhang et al., 2005) algorithms identify moreimportant skyline points by counting in how many subspaceskylines each skyline item is present or by evaluating skylinequeries in subsets of attributes. In this sense, our work has ashortcoming since our algorithms cannot catch the most relevantskyline points by using various metrics. In fact, our work does notfocus on recommending the most relevant ones of skyline pointsto the user.

In Papadias et al. (2005), BBS supports K-dominating skylinequery, which retrieves the K points that dominate the largestnumber of other points. The basic idea of K-dominating skylinecomputation is to (i) first compute the skyline by using BBS and(ii) apply a query point in the R-tree for each skyline point p andcount the number of points falling inside the dominance region7

of p. It is noted that R-tree is well suitable for such countingoperations. We slightly modify the above procedure so thatFIR-SKY can process K-dominating skyline queries. For the firststep, FIR-SKY computes all skyline points. Then, all descendentR-trees are identified based on the lattice value of a given skylinepoint p, and the counting procedure is repeatedly performed onthe identified R-trees. For example, counting operations areexecuted over RT{1,20}, RT{2,20}, and RT{3,20} for a skyline pointfrom RT{1,20} in Fig. 2 (b). Thus, FIR-SKY can also support theK-dominating query.

Like the K-dominating query, FIR-SKY can help other skylinealgorithms such as k-dominant skyline and subspace skylinealgorithms which are generally computationally intensive. InYuan et al. (2005), if U � V for two different subsets of attributes,the skyline of U is included in the skyline of V with theassumption that any two data points do not have same value onthe same dimension. FIR-SKY first computes all skyline points forall attributes. Then, subspace skyline frameworks compute moreimportant points among the identified skyline points instead of allpoints. This reduces computation overhead of subspace skylinealgorithms owing to small input data. It is noted that the aboveprocedure is not valid if the number of attributes is extremelylarge to the extent that all points are skyline points.

To increase users’ satisfaction, it is important to supportvarious orderings of low-cardinality attributes. In comparisonwith BBS and SaLSa, FIR-SKY has an advantage when orderrelationship of low-cardinality attributes changes. For example,let us assume that at first ‘‘blue’’ and ‘‘red’’ are mapped into 0 and1, respectively. This means that a user prefer blue items to reditems. Then, the user’s preference is changed inversely later. Insuch cases, FIR-SKY only re-constructs the lattice structure toreflect the changed relationship and rest of skyline computationcan be processed without any changes. However, all data itemsshould be re-indexed and re-sorted in BBS and SaLSa, and thisleads to large overhead.

We evaluate this property by conducting a simple experiment.For this experiment, we changed the order relation of a given low-cardinality attribute randomly during skyline computation. Sincethe BBS and SaLSa algorithms cannot adapt to such changewithout rebuilding whole R-tree and resorting all data items, theyare allowed to compute a skyline set only with the previous

Page 14: An efficient skyline framework for matchmaking applications

H. Han et al. / Journal of Network and Computer Applications 34 (2011) 102–115 115

structure. In contrast to BBS and SaLSa, FIR-SKY can compute askyline set simply by re-computing the BFS scheduling order foraffected values. We ran FIR-SKY three times, and the sizes ofskyline sets in all trials were 4524, 4619, and 4567. Additionally,the counts of the skyline points that BBS and SaLSa could notdetect were 295, 317, and 128, respectively.

Similarly, FIR-SKY can support personalized preference of low-cardinality attributes for skyline computation in that FIR-SKYconstructs the lattice structure based on the preference of eachuser. However, it is non-trivial to support personalized preferencein BBS and SaLSa. We believe that these properties are importantsince user’s preference may change many times and users’preference may differ from other’s preference in the itemrecommendation context.

Based on our observation, FIR-SKY has both advantages anddisadvantages in terms of user satisfaction. FIR-SKY cannot limitskyline points by using various metrics, and this might lead toburden of item recommendation applications. However, FIR-SKYcan support the K-dominating skyline query. Moreover, FIR-SKYsupports various users’ preferences without large computationaloverhead.

5. Conclusion

We have presented the FIR-SKY algorithm for item matchmak-ing applications. In this study, items are modeled on data withmultiple low-cardinality and unrestricted attributes. To computea skyline query efficiently, FIR-SKY adopts lattice-based indexing,and uses BFS to exploit the data characteristics. The lattice-basedindexing method incurs low IO overhead, and the BFS schedulingguarantees progressiveness. By the indexing and the scheduling,we design an optimized dominance-check (Dominance-Check-

Over-Relevant-Sets) that enables the FIR-SKY algorithm to havemuch better performance than state-of-the-art skyline algo-rithms. Then, we extend our algorithm to process conditionalqueries and improve the algorithm to execute FIR-BBS in parallel.

Acknowledgement

This work was supported by the National Research Foundation(NRF) grant funded by the Korea government (MEST) (No. 2010-0014387). The ICT at Seoul National University provided researchfacilities for this study.

References

Abel F, Herder E, Olmedilla D, Siberski W. Exploiting preference queries forsearching learning resources. In: Lecture notes in computer science; 2007.

Apache web services project. WebServices axis: /http://ws.apache.org/axis/S;2010.

Bartolini I, Ciaccia P, Patella M. Efficient sort-based skyline evaluation, vol. 33,2008.

Borzsonyi S, Kossmann D, Stocker K. The skyline operator. In: Proceedings of the17th international conference on data engineering, 2001.

Buchta C. On the average number of maxima in a set of vectors. Inf Process Lett1989;33(2).

Chan CY, Jagadish H, Tan KL, Tung A, Zhang Z. Finding k-dominant skylines in highdimensional space. In: SIGMOD ’06: proceedings of the 2006 ACM SIGMODinternational conference on management of data, 2006a.

Chan CY, Jagadish H, Tan KL, Tung A, Zhang Z. On high dimensional skylines. In:Proceedings of the 10th international conference on extending databasetechnology, 2006b.

Chomicki J, Godfrey P, Gryz J, Liang D. Skyline with presorting. In: Proceedings ofthe 19th international conference on data engineering, 2003.

Consortium A. ANSAware: /http://www.ansa.co.uk/S; 2010.Facciorusso C, Field S, Hauser R, Hoffner Y, Humbel R, Pawlitzek R, et al. A web

services matchmaking engine for web services. In: International conference onelectronic commerce and web technologies, 2003.

Globus Alliance. GT Information Services: Monitoring & Discovery System (MDS);2005.

Godfrey P, Shipley R, Gryz J. Maximal vector computation in large data sets. In:VLDB ’05: proceedings of the 31st international conference on very large databases, 2005.

Graham S. The role of private UDDI nodes in web services, part 1: six species ofUDDI. IBM Emerging Internet Technologies; 2001a.

Graham S. The role of private UDDI nodes in web services, part 2: nodes andoperator nodes. IBM Emerging Internet Technologies; 2001b.

Guttman A. R-trees: a dynamic index structure for spatial searching. In: SIGMOD’84: proceedings of the 1984 ACM SIGMOD international conference onmanagement of data, 1984.

Hadjieleftheriou M. Spatial Index Library /http://www2.research.att.com/marioh/spatialindexS; 2010.

Han H, Jung H, Kim S, Yeom HY. A skyline approach to the matchmaking webservice. In: Ninth IEEE/ACM international symposium on cluster computingand the grid, 2009.

Hoffner Y, Facciorusso C, Field S, Schade A. Distribution issues in the design andimplementation of a virtual market place. Computer Networks 2000;32(6).

ISO. Open distributed processing reference model; 1995.Jung H, Han H, Yeom HY, Kang S. A fast and progressive algorithm for skyline

queries with totally- and partially-ordered domains. Journal of Systems andSoftware 2010;83(3).

Kodama K, Iijima Y, Guo X, Ishikawa Y. Skyline queries based on user locations andpreferences for making location-based recommendations. In: LBSN ’09:proceedings of the 2009 international workshop on location based socialnetworks, 2009.

Kossmann D, Ramsak F, Rost S. Shooting stars in the sky: an online algorithm forskyline queries. In: VLDB ’02: proceedings of the 28th international conferenceon very large data bases, 2002.

Lee H-H, Teng W-G. Incorporating multi-criteria ratings in recommendationsystems. In: Proceedings of the 2007 IEEE international conference oninformation reuse and integration, 2007.

Lee J, won You G, won Hwang S, Selke J, Balke W-T. Optimal preference elicitationfor skyline queries over categorical domains. In: Proceedings of the 19thinternational conference on database and expert systems applications, 2008.

Lofi C, Guntzer U, Balke W-T. Efficient computation of trade-off skylines. In:Proceedings of the 13th international conference on extending databasetechnology, 2010.

Morse M, Patel J, Jagadish H. Efficient skyline computation over low-cardinalitydomains. In: VLDB ’07: proceedings of the 33rd international conference onvery large data bases, 2007.

OASIS. The UDDI Version 3 Specification /http://uddi.xml.orgS; 2010.OMG. CORBA trading object service; 2000.Papadias D, Tao Y, Fu G, Seeger B. Progressive skyline computation in database

systems. ACM Trans Database Syst 2005;30(1).Pei J, Jin W, Easter M, Tao Y. Catching the best view of skyline: a semantic approach

based on decisive subspaces. In: VLDB ’05: proceedings of the 31stinternational conference on very large data bases, 2005.

Pei J, Yuan Y, Lin X, Jin W, Ester M, Liu Q, et al. Towards multidimensionalsubspace skyline analysis. ACM Trans Database Syst 2006;31(4).

Tan KL, Eng PK, Ooi BC. Efficient progressive skyline computation. In: VLDB ’01:proceedings of the 27th international conference on very large data bases,2001.

Yu C, Bressan S, Ooi BC, Tan K-L. Querying high-dimensional data in single-dimensional space. VLDB J 2004;13(2).

Yuan Y, Lin X, Liu Q, Wang W, Yu JX, Zhang Q. Efficient computation of the skylinecube. In: VLDB ’05: proceedings of the 31st international conference on verylarge data bases, 2005.

Zhang Z, Guo X, Lu H, Tung A, Wang N. Discovering strong skyline points in highdimensional spaces. In: Proceedings of the 14th ACM international conferenceon information and knowledge management, 2005.