View
1
Download
0
Category
Preview:
Citation preview
Hong Kong Baptist University
DOCTORAL THESIS
Efficient group queries in location-based social networksLi, Yafei
Date of Award:2015
Link to publication
General rightsCopyright and intellectual property rights for the publications made accessible in HKBU Scholars are retained by the authors and/or othercopyright owners. In addition to the restrictions prescribed by the Copyright Ordinance of Hong Kong, all users and readers must alsoobserve the following terms of use:
• Users may download and print one copy of any publication from HKBU Scholars for the purpose of private study or research • Users cannot further distribute the material or use it for any profit-making activity or commercial gain • To share publications in HKBU Scholars with others, users are welcome to freely distribute the permanent URL assigned to thepublication
Download date: 11 Jun, 2022
Efficient Group Queries in Location-basedSocial Networks
Yafei LI
A thesis submitted in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
Principal Supervisor: Professor Jianliang XU
Hong Kong Baptist University
June 2015
ii
Declaration
I declare that this thesis has been composed by myself under the guidance of my principal
supervisor professor Jianliang XU. The thesis has not previously included in any thesis,
dissertation or report submitted to any institution for a degree, diploma or other qualifica-
tion. All sources of information have been acknowledged by means of references to the
relevant publications.
Signature:
Date: June 2015
i
Abstract
Nowadays, with the rapid development of GPS-equipped mobile devices, location-based
social networks have been emerging to bridge the gap between the physical world and
online social networking services. Various types of data, such as personal locations,
check-ins, microblogs and social relations, have been available in location-based social
networks. Efficiently managing and analyzing such data to meet users’ daily query re-
quirements become a challenging task. Among all the existing works in location-based
social networks, group query is one of the most important research topics. In this thesis,
we investigate query techniques for location-based services in social networking applica-
tions. Specifically, considering a location-based social network, we study spatial-aware
interest group queries, geo-social k-cover group queries, and social-aware ridesharing
group queries.
Firstly, we study the spatial-aware interest group queries in location-based social net-
works. Recently, most of the location-based social networks release check-in services that
allow users to share their visiting locations with their friends. These locations, considered
as spatial objects, are usually associated with a few tags that describe the features of those
locations. Utilizing such information, we propose a new type of Spatial-aware Interest
Group (SIG) query that retrieves a user group of size k where each user is interested in the
query keywords and the users are close to each other in the Euclidean space. We prove this
query problem is NP-complete, and develop two efficient algorithms IOAIR and DOAIR
based on the IR-tree for the processing of SIG queries. We also validate the performance
efficiency of the proposed query processing algorithms by empirical evaluation.
Secondly, we study the problem of geo-social k-cover group queries for collaborative
ii
spatial computing. In this problem, we propose a novel type of geo-social queries, called
Geo-Social K-Cover Group (GSKCG) query, which is based on spatial containment and
a new modeling of social relationships. Intuitively, given a set of spatial query points
and an underlying social network, a GSKCG query finds a minimum user group in which
the members satisfy certain social relationship and their associated regions can jointly
cover all the query points. Albeit its practical usefulness, the GSKCG query problem
is NP-complete. We consequently explore a set of effective pruning strategies to derive
an efficient algorithm for finding the optimal solution. Moreover, we design a novel
index structure tailored to our problem to further accelerate query processing. Extensive
experiments demonstrate that our algorithm achieves desirable performance on real-life
datasets.
Thirdly, we study the problem of social-aware ridesharing group queries. With the
deep penetration of smartphones and geo-locating devices, ridesharing is envisioned as a
promising solution to transportation-related problems such as congestion and air pollution
for metropolitan cities. Despite the potential to provide significant societal and environ-
mental benefits, ridesharing has not so far been as popular as expected. Notable barriers
include the social discomfort and safety concerns when traveling with strangers. To over-
come these barriers, in this thesis, we propose a new type of Social-aware Ridesharing
Group (SaRG) query which retrieves a group of riders by taking into account their social
connections besides traditional spatial proximities. Because the SaRG query problem is
NP-hard, we design an efficient algorithm with a set of powerful pruning techniques to
tackle this problem. We also present several incremental strategies to accelerate the search
speed by reducing the repeated computations. Moveover, we propose a novel index tai-
lored to the proposed problem to further speed up the query processing. Experimental
results on real datasets show that our proposed algorithms achieve desirable performance.
The works of this thesis show that the group query processing techniques are effective,
which would facilitate the wider deployment of such query services in real applications.
Keywords: Location-based services, Query processing, Group queries, Indexing, Spatial
database, Location-based social networks, Social constraints, Ridesharing.
iii
Acknowledgements
I would like to express my deep gratitude to my principle supervisor, Prof. Jianliang XU,
for his great patience, inspiring guidance and constructive suggestions in my studies and
research works in these years. He has brought me into this challenging research area
and shared insightful experiences with me. I would like to thank my co-supervisor, Dr.
Weifeng SU, for his continuous encouragement and supporting in life. I would also like
to thank other supervisors, Dr. Rui CHEN, Dr. Haibo HU, Dr. Byron CHOI for their
good suggestions on my research studies.
I would like to thank my colleagues for their direct and indirect help. In particular, I
should mention Mr. Lei CHEN, Mr. Qian CHEN, Mr. Cheng XU, Mr. Zhe FAN, Mr.
Peipei YI, Dr. Xin LIN, Dr. Qijun ZHU, Dr. Dingming WU, Dr. Yun PENG, among
many others.
Finally, I take this special occasion to thank my father Guoming LI and my mother
Xiuqin GAO for raising and supporting me for so many years. I also wish to thank my
dear Yanli ZENG and other family members for their full understanding and supporting
in these years. Without them, I would never go so far.
iv
Table of Contents
Declaration i
Abstract ii
Acknowledgements iv
Table of Contents v
List of Tables viii
List of Figures ix
Chapter 1 Introduction 1
1.1 Spatial-aware Interest Group Queries . . . . . . . . . . . . . . . . . . . . 3
1.2 Geo-Social K-Cover Group Queries . . . . . . . . . . . . . . . . . . . . 4
1.3 Social-aware Ridesharing Group Queries . . . . . . . . . . . . . . . . . . 6
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 2 Related Works 9
2.1 Spatial query processing . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Social query processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Spatial keyword query processing . . . . . . . . . . . . . . . . . . . . . 11
2.4 Geo-Social query processing . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Ridesharing query processing . . . . . . . . . . . . . . . . . . . . . . . . 14
v
Chapter 3 Spatial-aware Interest Group Queries in Location-based Social Net-
works 17
3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Proposed Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Preliminary: IR-Tree . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.3 Interest Oriented Algorithm . . . . . . . . . . . . . . . . . . . . 23
3.2.4 Diameter Oriented Algorithm . . . . . . . . . . . . . . . . . . . 29
3.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Datasets and Queries . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Chapter 4 Geo-Social K-Cover Group Queries for Collaborative Spatial Com-
puting 43
4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Basic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.2 Basic Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.3 Diameter Based Pruning . . . . . . . . . . . . . . . . . . . . . . 51
4.2.4 Access Order Based Pruning . . . . . . . . . . . . . . . . . . . . 54
4.3 Hybrid Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.1 SaR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.2 Enhanced SaR-tree . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.3 GSKCG Query Processing . . . . . . . . . . . . . . . . . . . . . 62
4.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4.1 Datasets and Queries . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 65
vi
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Chapter 5 Towards Social-aware Ridesharing Group Query Services 70
5.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.1 RSGExplorer Algorithm . . . . . . . . . . . . . . . . . . . . . . 75
5.2.2 Quota Available Strategy . . . . . . . . . . . . . . . . . . . . . . 81
5.2.3 Group Diameter Strategy . . . . . . . . . . . . . . . . . . . . . . 83
5.2.4 k-plex Based Strategy . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.5 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 Incremental Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.1 Incremental Computation of Core Numbers . . . . . . . . . . . . 87
5.3.2 Social Diameter-based Bounding . . . . . . . . . . . . . . . . . 88
5.3.3 Neighbor-based Bounding . . . . . . . . . . . . . . . . . . . . . 89
5.4 Hybrid Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4.1 SIR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4.2 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.5.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . 93
5.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 93
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Chapter 6 Conclusions and Future Work 98
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Bibliography 101
Curriculum Vitae 109
vii
List of Tables
3.1 Summary of notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Example Interest Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Dataset Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 Summary of notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Dataset properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.1 Summary of notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Survey results (216 participants) . . . . . . . . . . . . . . . . . . . . . . 72
5.3 Access indexes of users in Figure 5.4 . . . . . . . . . . . . . . . . . . . . 82
5.4 Dataset properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
viii
List of Figures
1.1 A framework of the social-aware ridesharing system . . . . . . . . . . . . 7
1.2 An example of Slugging . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 An example of SIG query . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Tree Index Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Example of Theorem 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Distance between u1 and its neighbors . . . . . . . . . . . . . . . . . . . 29
3.5 Constructing G4(u1, u11) . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 Varying k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7 Varying α . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.8 Varying k on Dianping (α = 0.9) . . . . . . . . . . . . . . . . . . . . . . 39
3.9 Varying the number of query tags . . . . . . . . . . . . . . . . . . . . . . 40
3.10 Varying Buffer Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.11 Varying the Number of Users . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 An example of a location-based social network for GSKCG query . . . . 46
4.2 Branch and bound search tree . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Sorted user list ListP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 A sample SaR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 Example of CBRs in an SaR-tree . . . . . . . . . . . . . . . . . . . . . . 56
4.6 A sample LBSN for constructing CBR . . . . . . . . . . . . . . . . . . . 57
4.7 Constructing user u’s internal CBRs . . . . . . . . . . . . . . . . . . . . 58
4.8 Constructing a user u’s external CBRs . . . . . . . . . . . . . . . . . . . 58
4.9 Running time vs. k value . . . . . . . . . . . . . . . . . . . . . . . . . . 65
ix
4.10 Running time vs. number of query points . . . . . . . . . . . . . . . . . 66
4.11 Running time vs. query point coverage . . . . . . . . . . . . . . . . . . . 66
4.12 Running time under multiple familiar regions . . . . . . . . . . . . . . . 67
4.13 Pruning capabilities of different schemes . . . . . . . . . . . . . . . . . . 67
4.14 Size of query results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.15 Running time vs. network size . . . . . . . . . . . . . . . . . . . . . . . 68
4.16 Quality comparison of the returned groups . . . . . . . . . . . . . . . . . 69
5.1 Numbers of potential social groups of size 5 . . . . . . . . . . . . . . . . 73
5.2 An example of an SaRG query . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 Branch and bound search tree . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4 An example of SaRG query . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5 An example of SIR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.6 Running time vs. group size . . . . . . . . . . . . . . . . . . . . . . . . 94
5.7 Running time vs. k value . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.8 Running time vs. the number of riders . . . . . . . . . . . . . . . . . . . 95
5.9 Pruning abilities of different schemes . . . . . . . . . . . . . . . . . . . 96
5.10 Travel cost vs. k or s . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
x
Chapter 1
Introduction
With the rapid development of location-aware mobile devices, ubiquitous Internet ac-
cess and social computing technologies, a large volume of users’ personal data, such as
locations, check-ins, microblogs, tweets, and social connections, has been abundantly e-
merging and readily accessible from various location-based social networks (e.g., Twitter,
Fackbook). Moveover, the amount of such data is growing explosively. For example, by
July 2014, the number of users in Twitter has been up to 500 millions and the average
number of tweets per day has exceeded 58 millions;1 the total number of monthly ac-
tive Facebook users has been up to 1.3 billions and the average number of the messages
sent on Facebook per 20 minutes has exceeded 640 millions.2 Hence, how to efficiently
manage these data to satisfy users’ daily query requirements is a crucial task. Among all
the existing studies in the location-based social networks, group queries have a number
of practical applications (e.g., activity planning [37, 68], product promotion [40], trav-
el recommendation [60, 61], ridesharing [16, 44]). In this thesis, we study group query
techniques for location-based services in location-based social networks.
A group query is issued to a database when a user wants to find a set of objects or
users satisfying some required query constraints. Specifically, consider a spatial database
in which each object is coupled with several tags to indicate its features. A group query
usually inputs a set of query keywords and a query point, and returns a set of spatial ob-
1http://www.statisticbrain.com/twitter-statistics/2http://www.statisticbrain.com/facebook-statistics/
1
jects that can fully or collaboratively cover all the query keywords and that are close to the
query point. Given a social network, a typical group query requests a set of users, in which
the connections among them satisfy some specific social constraints (e.g., minimum ac-
quaintance). Currently, location-based social networks, such as Foursquare and Facebook
Places, are bridging the gap between the physical world and the online social networking
services through acquired user locations. User-generated data from these location-based
social networks is usually mixed with more than one data type (e.g., check-ins, tweets or
microblogs coupled with locations, social relations, and trajectories). Group queries are
subsequently evolving with some novel forms over such data. Therefore, efficient query
processing techniques need to be developed to tackle these newly generated group query
problems.
A few recent works have investigated group query techniques on group queries in
location-based social networks [10,39,40,42,44,45,50,67–69,72]. The representative s-
tudies on geo-textual group queries [10,42] consider the collective spatial keyword queries
based on the objects’ spatial distance and keyword coverage. However, they do not con-
sider the social factor, such as users’ interests reflected by their check-ins on these spatial
objects. There is little study on finding a group by considering the group members’ in-
terests and their spatial distances. While the studies [39, 67, 68] consider the geo-social
group queries which intend to find a group of attendees satisfying the given social and
spatial distance constraints, these queries do not fully exploit new search possibilities
brought by social computing technologies. For example, finding a group of collaborative
workers, whose service regions can jointly cover the given spatial tasks, is an important
geo-social query problem for collaborative spatial computing. However, such practically
useful queries on spatial containment and social relations have not been covered by the
current existing works. Besides, another type of group queries, namely ridesharing group
query, is treated as a promising approach to resolve the transportation-related problem-
s in metropolitan cities, such as traffic congestion and air pollution. However, existing
works [44, 45, 50, 69, 72] from both industry and academia just focus on the coordina-
tion of ridesharing trips and schedules. They do not consider the trust issue in forming
2
ridesharing groups, which may make the ridesharing unsafe and uncomfortable.
In this thesis, we propose several novel group queries for location-based social net-
works to fill the current research gap stated above. Specifically, we have selected three
representative group queries, namely, spatial-aware interest group queries, geo-social k-
cover group queries, and social-aware ridesharing group queries. The main challenge of
processing the proposed group queries lie in the hardness of these problems. We prove that
the proposed group queries are NP-complete or NP-hard problems. Therefore, designing
efficient algorithms to tackle these hard problems requires non-trivial efforts. In this the-
sis, we develop several efficient query processing algorithms with a number of pruning
strategies. We also design several efficient index structures to accelerate the search speed.
1.1 Spatial-aware Interest Group Queries
The first part of this thesis is focused on efficient spatial-aware interest group queries in
location-based social networks. Currently, most of the location-based social networks re-
leased check-in services that allow users to share their visiting locations with their friends.
These locations, considered as spatial objects, are usually associated with a few tags that
describe the features of those locations, e.g., spatial object ‘starbucks’ with tags ‘food’,
‘beverage’, and ‘coffee’. If a user checks in the spatial object ‘starbucks’, the user may be
interested in ‘food’, ‘beverage’, or ‘coffee’. These voluntary check-in actions reflecting
the users’ interests can benefit many applications. Utilizing such information, Chapter 3
proposes a new type of Spatial-aware Interest Group (SIG) queries that retrieves a user
group of size k where each user is interested in the query keywords and the users are close
to each other in the Euclidean space.
Existing works in the literature have considered group queries in location-based social
networks. [41,68] aim at finding a group of attendees close to a rally point and ensure that
the selected attendees have a good social relationship to create a good atmosphere in the
activity. [67] aims to find the activity time and attendees with the minimum total social
distance to the initiator. [37, 38] explore a group of experts whose skills can cover all the
requirements and the communication cost among group members is low. Different from
3
existing work, the SIG query retrieves a user group of size k that maximizes a ranking
function combining the diameter of the group (i.e., the distance between the farthest pair
of users) and the group’s interest in the query keywords.
SIG queries are useful in many scenarios. For example, consider that a company wants
to hold promotion campaigns in some regions. The company is interested in identifying
the regions containing potential customers who are interested in the features (query key-
words) of the product promoted. Another example is for interest-based group gathering.
Query keyword ‘movie’ may find a group of nearby people who are movie lovers, while
query keyword ‘NBA’ could retrieve a group of nearby people who like playing basket-
balls. Note that the group size in these queries is usually constrained due to limited venue
capacity and/or financial budget. Chapter 3 is dedicated to the efficient query processing
techniques on this kind of queries.
1.2 Geo-Social K-Cover Group Queries
The second part of this thesis is devoted to efficient geo-social k-cover group queries
for collaborative spatial computing. The convergence of location data and social data
has enabled a new computing paradigm that explicitly combines both location and social
factors to generate useful computational results for either business or social good. We
use the term collaborative spatial computing to represent this emerging paradigm. The
idea of collaborative spatial computing has been widely used in various domains. One
of the most important applications of collaborative spatial computing in location-based
social networks is geo-social queries, which are attracting increasing interest from both
industrial and academic communities.
The study of geo-social queries is in its incipiency. The pioneering studies [25,39,41,
68] typically consider geo-social queries that take as inputs a set of mobile users, a query
location point and certain social acquaintance constraint and that return a set of users with
the minimum location distance while satisfying the social constraint. While being useful
in some applications (e.g., activity planning), these queries do not fully exploit new search
possibilities brought by geo-social data. In Chapter 4, we propose a novel type of geo-
4
social queries, called Geo-Social K-Cover Group (GSKCG) queries, which is based on
spatial containment and a new modeling of social relationships. Intuitively, given a set of
spatial query points and an underlying social network, a GSKCG query finds a minimum
user group in which the members satisfy certain social relationship and their associated
regions can jointly cover all the query points.
GSKCG queries have applications in a wide range of location-based services. Some
of them are listed as follows: 1) Travel recommendation: To recommend a self-drive tour
of a few points of interest (POIs) (e.g., [60,61]), a GSKCG query helps to find a minimal
group of tourists who are collectively familiar with these POIs (e.g., in terms of weather,
accommodation safety, road conditions, and traffic laws) so as to reduce accident risks
and who have relatively tight social relations in order to make the tour more trustful and
more harmonious. The minimum group size makes it easier for all group members to
reach a consensus in subsequent planning. 2) Spatial task outsourcing: Given a set of
spatial tasks, each associated with a spatial location, one needs to distribute them to a set
of workers, each having a service region. To successfully accomplish the tasks, the service
regions of the selected workers should cover all spatial tasks’ locations, and the workers
are expected to have good collaborative relationships so that the tasks can be efficiently
performed. A GSKCG query directly addresses this worker selection problem in spatial
task outsourcing. In practice, the size of the group of selected workers should be minimum
to minimize employment cost. 3) Collaborative team organization: GSKCG queries are
useful for marketing and promotion agencies. For example, in an agency, each agent has
several familiar market areas and several good collaborators. If a company wants to hire
a marketing team to promote its products in some market areas, a GSKCG query finds
a good team that covers all promotion locations and that is cohesive while causing the
minimum cost for the company. As another example, a community organization can resort
to a GSKCG query to find a minimal group of investigators to conduct a questionnaire
survey in several sites. The returned group will be jointly familiar with all the sites and
have a good collaborative atmosphere in order to efficiently deliver, collect and analyze
the questionnaires.
5
Compared with the SIG query problem, the GSKCG query problem is more challeng-
ing because of the complex social relations introduced. We solve this problem in Chapter
4.
1.3 Social-aware Ridesharing Group Queries
The third part of this thesis is presented towards efficient social-aware ridesharing group
search. Nowadays, there is tremendous unused transportation capacity worldwide in the
form of unoccupied seats in private cars. Not only would filling some of these seats reduce
smog, carbon emissions, and fuel consumption, but it also could create opportunities for
increasing local social capital. Ridesharing is a natural and practical approach to make
use of these unoccupied seats and is envisioned as a promising solution to alleviating
transportation-related problems (e.g., traffic congestion, air pollution) in metropolitan c-
ities. As reported in a recent study [16], the potential traffic reduction in a city could be
as high as 31-59% if users are willing to share a ride with people whose travel patterns
are similar. Moreover, ridesharing can save on traffic expense for both drivers and riders.
There have been some existing works on the ridesharing problem from both indus-
try and academia with a focus on coordination of ridesharing trips and schedules. Given
a driver’s origin and destination, a ridesharing system returns the driver a set of riders
by considering the trip and schedule similarity. Generally, current works can be catego-
rized into three types: i) static ridesharing [1, 4, 5, 44, 62, 66] which refers to the scenario
where the requests of drivers and riders are known in advance; ii) dynamic rideshar-
ing [24, 32, 50, 72] where riders and drivers continuously enter and leave the system and
are matched up in real time or on a short notice; iii) trust-conscious ridesharing [1, 16]
which addresses the trust issue in ridesharing. Chapter 5 is concerned with trust-conscious
ridesharing. Existing approaches include the adoption of reputation-based systems and
profile checking by linking with social networks like Facebook. However, these attempts
cannot remove the major barriers in current ridesharing systems such as social discomfort
and safety concerns when traveling with strangers. Little work studies ridesharing by tak-
ing social relations into consideration. Although [16] considers ridesharing with friends
6
ride requests
allocate drivers
ride offersallocate riders
soci
al r
elat
ions
engage engage
Ridesharing
Service Provider
Riders Drivers
Social Network
Figure 1.1: A framework of the social-aware ridesharing system
or friends of friends, this kind of trust-conscious ridesharing is not very practical as will
be shown in the social model analysis (elaborated in Chapter 5). Thus, these existing so-
lutions cannot be applied to the social-aware ridesharing problem considered in Chapter
5.
In Chapter 5, we propose a new type of ridesharing queries, called Social-aware
Ridesharing Group (SaRG) queries, which is based on trip matching and social acquain-
tance. Broadly, as illustrated in Figure 1.1, our proposed ridesharing system consists of
three parties: (i) riders (or passengers who want to participate in ridesharing), (ii) drivers
(or private car owners who offer ridesharing), and (iii) ridesharing service provider (RSP)
(the server in charge of the arrangement of ridesharing). The riders submit ride requests
to the RSP, while the drivers send in ride offers. In other words, a ride offer provided
by a driver forms an SaRG query; the riders who submitted ride requests form the data
space (or search space); the RSP arranges the best ride matches of ridesharing by jointly
considering trip matching, social connections as well as the capacity of a car. Designing
efficient matching algorithms for the RSP is the most important task to make the system
work effectively. Note that, in our problem, the RSP hosts a set of active ride requests
(expired requests might be dropped and re-submitted). Once there comes a ride offer from
a driver, the RSP will match the most suitable riders to the driver. A ridesharing group is
composed of a driver and the most suitable riders.
In our ridesharing system, we adopt a simple yet popular form of ridesharing called
Slugging [44]. Slugging assumes that the driver’s trip is fixed and that the riders would
walk to the origin location of the driver’s trip, board at the departure time, alight at the
7
driver’s destination, and then walk to their own destinations. The idea of Slugging is
illustrated in Figure 1.2.
v1
v2
v1
v2
v3v3
Figure 1.2: An example of Slugging
The consideration of social factors in ridesharing brings several new research chal-
lenges. First, how to capture and model social constraints for the purpose of ridesharing
is a fundamental issue. Second, the social relationship may not be incremental in nature
(e.g., the acquaintance constraint among the users of a ridesharing group may not hold
after the removal of one user). As such, the social-aware ridesharing problem becomes
more challenging. Indeed, as we shall prove in Chapter 5, the SaRG query problem is NP-
hard, and therefore how to design an efficient algorithm to retrieve the optimal answer to
an SaRG query is the focus of Chapter 5. Our key insight is that in practical settings
an SaRG query possesses some intrinsic properties (e.g., the number of seats in a car is
usually small; the riders who are far away from the trip origin cannot be candidates of a
ridesharing group), which make the problem tractable.
1.4 Thesis Organization
The rest of this thesis is organized as follows. In Chapter 2, we present the related works
that are relevant to this thesis. In particular, we highlight the research works which are
closely related to our contributions in this thesis. We study the spatial-aware interest group
queries in location-based social networks in Chapter 3. In Chapter 4, we detail the geo-
social k-cover group queries for collaborative spatial computing. Chapter 5 presents the
social-aware ridesharing group queries. Finally, we summarize our contributions made in
this thesis and discuss the possible directions for the future work in Chapter 6.
8
Chapter 2
Related Works
In the first chapter, we have discussed the importance of group queries in location-based
social networks and proposed three important research problems. In this chapter, we
survey the existing works that are closely relevant to our proposed research problems.
2.1 Spatial query processing
Spatial query processing using R-tree and its variants has been extensively studied over
the past three decades. The existing works have studied various types of queries, including
k-nearest-neighbor queries [31, 34, 35, 49, 52], range queries [48, 59], and closest-pair
queries [22, 30, 57].
As a pioneering study on spatial queries processing, Roussopoulos et al. [52] present-
ed an efficient branch-and-bound R-tree traversal algorithm to search the nearest neighbor
object to a query point, and then extended it to the k-nearest-neighbor search. Mean-
while, Katayama et al. [34] proposed a new index structure named SR-tree, which inte-
grates bounding spheres and bounding rectangles for high-dimensional nearest neighbor
queries. However, there are significant overlaps among the minimum bounding rectangles
(MBRs) in both R-tree and SR-tree, these overlaps result in weak pruning efficiency. To
overcome this weakness, a novel index structure R*-tree was presented in [31] to reduce
the overlapping MBRs. Based on R*-tree index structure, Hjaltason et al. [31] proposed
an incremental algorithm to efficiently search the nearest neighbor. Kolahdouzan and
9
Shahabi [35] presented a novel approach using first-order voronoi diagrams to efficiently
evaluate k-nearest-neighbor queries in spatial network databases. Moreover, Papadias et
al. [49] extended the concept of the nearest neighbor query by considering a group of
points which aims to find a set of data points with the smallest sum of distances to all the
query points, and proposed various pruning heuristics to efficiently process such group
nearest-neighbor queries.
For the spatial range query processing, Tao et al. [59] studied the range search on mul-
tidimensional uncertain data. They presented a novel concept of “probabilistically con-
strained rectangle”, which supports effective pruning/validation of nonqualifying/qualifying
data. They also developed a new index structure called U-tree for minimizing the query
overhead. Pagel et al. [48] proposed a probabilistic model for user-defined window
queries, and characterized the efficiency of spatial data structures in terms of the expected
number of data bucket accesses needed to perform a window query.
Closest pair query is another important query in spatial databases. Corral et al. [30]
presented non-incremental recursive and iterative branch-and-bound algorithms for k-
closest pair queries. Hjaltason and Samet et al. [22] proposed an incremental algorithm
based on priority queues for distance join queries. Shin et al. [57] suggested adaptive mul-
tistage and plane-sweep techniques for K-distance join queries and incremental distance
join queries.
Our work on SIG queries can be seen as extending the R-tree to handle queries with
mixed spatial and keyword information, which retrieves a set of users who satisfy the
mixed spatial and interest constraints.
2.2 Social query processing
There have been some studies on group and team queries over social networks with the
goal of finding a user group with a certain social relationship. Social groups or teams
are usually cohesive subgraphs formed by users with acquaintance relations. Their ac-
quaintance levels can be measured by several classical graph models, such as clique [29],
k-core [55], and k-plex [47]. The clique model idealizes cohesive properties so that it
10
seldom exists in real-life social networks and is difficult to compute. Both k-core and
k-plex focus on a degree based model. However, k-plex is NP-complete since it restricts
the subgraph size, while k-core further relaxes to achieve the linear time complexity with
respect to the number of edges.
Group and team queries have been studied in the context of social networks [12, 27,
76], including social-temporal queries [67], and expert collaboration queries [37, 38]. In
detail, Yang et al. [67] proposed a social-temporal group query to find a group of activ-
ity attendees with the minimum total social distance to the query issuer. They proposed
two efficient algorithms, SGSelect and STGSelect, which include effective pruning tech-
niques and employ the idea of pivot time slots to substantially reduce the query processing
time. Lappas et al. [37] and Li et al. [38] studied the problem of expert team formula-
tion which aims to find a group of experts covering all required skills and minimize the
communication cost among them.
In this thesis, we use k-core to model users’ social relations, which is different from
the previous studies. In addition, our proposed queries, GSKCG and SaRG, take into
consideration the spatial factor.
2.3 Spatial keyword query processing
Recently, spatial queries have been extended to incorporate text keywords, known as spa-
tial keyword queries in the literature. Zhou et al. [73] proposed a hybrid index structure
to handle both textual and spatial queries. They studied the performance of hybrid index
structures that integrate text indexes and spatial indexes for location-based web search.
This work opens a stream of research topics on spatial keyword search. Cong et al. [18]
presented a new indexing scheme called IR-tree, which integrates the R-tree and invert-
ed files for location-aware top-k object retrieval. Rocha et al [51] proposed the top-k
spatial keyword queries on road networks where the distance between the query location
and the spatial object is the shortest path. An efficient method based on a new hybrid in-
dex, cell-keyword conscious B+-tree, was proposed by Cong et al. [19] to process top-k
queries on trajectories database. However, these works all assume a static query location
11
at a snapshot. They cannot provide a mobile user a continuously aware of the k spatial
web objects that best match a query with respect to location and text relevancy. Based
on these practical query requirements, Wu et al. [64] studied the efficient processing of
continuously moving top-k spatial keyword (MkSK) queries. They proposed two effi-
cient methods for computing safe zones that guarantee correct results at any time with the
minimum communication cost.
The difference between the top-k query and k-nearest-neighbor query lie in whether
the keywords in the query are used as a soft or hard constraint [42]. Fellpe et al. [23]
considered how to find the k-nearest-neighbor of the query location, with each object in
the result containing the set of keywords issued in the query. Lu et al. [43] proposed a
hybrid index tree called IUR-tree to efficiently process reverse spatial textual k-nearest-
neighbor queries which finds the objects that take the query object as one of their k most
spatial-textual similar objects.
Different from the works on top-k and k-nearest-neighbor queries presented above,
Fan et al. [26] studied the problem of spatio-textual range queries on a new kind of spatio-
textual data named regions-of-interest (ROIs). They developed textual-based and grid-
based filtering algorithms to efficiently find a set of objects that have large overlap with
the query region and high textual similarity. It is the first work that considers the queries
on the spatial object with the spatial region and textual properties.
However, in some cases, a spatial object may not cover all the query keywords, which
may lead to empty solutions. Thus, several works proposed the aggregate spatial keyword
search to tackle this problem. It returns a set of spatial objects that collaboratively cover
all the query keyword. Zhang et al. [70] studied an m-closest-keyword (mCK) query
that finds a set of partially closest objects covering m specified keywords. Cao et al. [10]
proposed a collective spatial keyword query that retrieves a group of nearby spatial objects
to collectively cover the specified keywords. The techniques they presented can solve the
presented problem, but the problem of collaborative spatial keywords search is usually
NP-hard or NP-complete that results in inferior query processing performance. Compared
with the above two works, Long et al. [42] presented a more efficient and exact algorithm
12
to tackle such problems by adopting a distance owner-driven approach.
It is noteworthy that, unlike these previous studies, the proposed SIG query in this the-
sis explores the relationship between users’ locations and interests in the query keywords
and searches the k-size maximum interest group on location-based applications.
2.4 Geo-Social query processing
Efficiently processing queries that consider both spatial and social constraints attracts in-
creasingly attention recently. A main stream is to mine users’ location and social network
data to find the relationships between the users and their locations. [13, 54] have shown
that users with short social distances usually live geographically close.
Yet query processing research in this direction is still in its infancy. Liu et al. [41]
proposed the circle-of-friend query to find minimal-diameter social groups. Shi et al. [56]
presented a model by considering both spatial information and the social relationships
between users who visit the clustered places. They extended the density-based clustering
paradigm and applied it on places which are visited by users of a geo-social network.
Armenatzoglou et al. [3] proposed a general framework that offers flexible data manage-
ment and algorithmic design for Geo-Social Network (GeoSN) queries. Their architecture
segregates the social, geographical and query processing modules. Each GeoSN query is
processed via a transparent combination of primitive queries issued to the social and ge-
ographical modules. Yang et al. [68] proposed a socio-spatial group query to select a
group of nearby attendees with a tight social relationship. They designed a new index
structure called Social R-tree to integrate the users’ social relationships into an R-tree for
efficient query processing. This index is different from our proposed Enhanced SaR-tree
in Chapter 4 in that it is used to reduce the checking states during the enumeration. Zhu
et al. [39] presented a new family of geo-social group queries with minimum acquain-
tance constraint (GSGQs), and also designed a new index structure named SaR-tree to
accelerate the GSGQs queries. However, the SaR-tree cannot be directly adopted by our
GSKCG queries due to our regional spatial factor which differs from the point spatial
factor in [39].
13
Unlike the studies [41, 68] that aim to minimize the spatial distance among selected
users, our GSKCG query aims to find a group of users whose associated regions jointly
cover all query points, a brand new spatial constraint with important real-life applications.
Moreover, we use a different model k-core to measure the level of social acquaintance, a
more reasonable measure for practical use.
2.5 Ridesharing query processing
We survey the ridesharing query processing techniques from the following three aspects:
static ridesharing, dynamic ridesharing, and trusted ridesharing problems.
Most of the early studies considered static ridesharing, which refers to the ridesharing
where the requests of drivers and riders are known in advance. We classify the stat-
ic ridesharing in the following categories: slugging, carpooling, and dial-a-ride. Slug-
ging [5] is one particular form of ridesharing where passengers walk to the origin of the
driver’s trip, board at the departure time, debark at the driver’s destination and then walk
to their own destinations. Ma and Wolfson [44] studied slugging from a computational
perspective using a graph abstraction. Carpooling is another representative application
of ridesharing for daily commutes, where private car drivers declare their availability for
pick-up and later bringing back riders. The main issue in carpooling is about the as-
signment of riders to drivers and the identification of each driver’s route to minimize
the travel cost. For small-size carpooling, it can be solved by using linear programming
techniques [7, 9]. To deal with large-size problem, many heuristic algorithms have been
proposed [1, 62]. More recently, Yan and Chen [66] employed a time-space network
flow technique to develop a model for the many-to-many carpooling system with mul-
tiple vehicle and person types. They develop a solution algorithm based on Lagrangian
relaxation. In the dial-a-ride problem (DARP), no private-car is involved and the trans-
portation is carried out by public vehicles (such as taxi) that provide a shared service.
Users formulate requests by specifying the origin and destination locations. The aim
is to design a minimum-cost set of vehicle routes accommodating all requests under a
number of spatial-temporal constraints. Earlier works on DARP can be found in a sur-
14
vey [21]. DARP is NP-hard in general. Only problems that involve small number of
vehicles and ride requests can be solved exactly and the methods are often by integer
programming techniques [20]. For large-scale DARP, heuristics are still the most popular
methods [4, 21, 65]. These approaches usually have two phases, where the first one is to
obtain an initial schedule and the second one is to improve the solution by some local
searches.
Enabled by recent mobile technologies, dynamic ridesharing services have been gain-
ing increasing attention (e.g., [24, 32, 72]). In dynamic ridesharing systems, riders and
drivers continuously enter and leave the system; dynamic ridesharing algorithms match
up them in real time on short notice. Existing works can be broadly classified into two
categories: centralized and distributed. Centralized real-time ridesharing relies on a cen-
tral service provider to perform all operations for ridesharing. A recent survey on the
optimization techniques for centralized dynamic ridesharing can be found in [2]. Vari-
ous optimization objectives (e.g., minimizing system-wide vehicle miles or travel time)
and spatial-temporal constraints (with desired departure/arrival time or spatial proxim-
ity requirements) have been considered. [50] proposed an opportunistic user interface
to support centralized rideshare planning whilst preserving location privacy. [32] is the
latest work that modeled a centralized real-time ridesharing problem with service guar-
antee. They proposed several novel kinetic tree-based algorithms that are better suited to
dynamic request scheduling and on-the-fly route adjustment. The drawback of the cen-
tralized ridesharing is its lack of scalability, especially when ridesharing requests are in a
large volume. To address this issue, distributed ridesharing solutions have been develope-
d (e.g., [24, 72]). [24] proposed a dynamic taxi-sharing algorithm based on peer-to-peer
communications and distributed coordination. [72] presented a distributed ridesharing ser-
vice based on a new geometry matching algorithm to shorten the waiting time for passen-
gers and to avoid traffic jams. However, all these works considered only the participants’
itineraries and time schedule constraints in rideshare assignment. They cannot be ap-
plied to the social-aware rideshare assignment problem, which imposes complex social
constraints.
15
A few recent existing works have been intended to address the trust issue in rideshar-
ing [1, 16]. Suggested approaches include the adoption of reputation-based systems and
profile checking by linking with social networks like Facebook [1]. Both of these ap-
proaches entail significant involvement from participants. In [16], Cici et al. suggested
grouping participants who are friends or friends of friends in the assessment of the poten-
tial benefits of ridesharing. However, as indicated by our user study result, such simple
social constraints can be either too restricted or too relaxed to be practical for realistic
ridesharing systems.
16
Chapter 3
Spatial-aware Interest Group Queries
in Location-based Social Networks
Currently, most of the location-based social networks release check-in services that al-
low users to share their visiting locations with friends. These checked-in spatial objects
usually reflect the users’ interests. Moreover, the interests and locations of users are es-
sential for activity planning and product promotion. Based on such data, in this chapter,
we study the spatial-aware interest group queries in location-based social networks. The
rest of this chapter is organized as follows. Section 3.1 presents the problem definition.
Section 3.2 presents two efficient algorithms based on IR-tree for the processing of SIG
queries. Section 3.3 shows the empirical study of our proposed algorithms on two real
datasets. Section 3.4 summarizes this chapter.
3.1 Problem Definition
In this section, we give some preliminaries and provide the problem statement, followed
by an example to elaborate the problem defined. Table 3.1 summarizes the notations used
throughout this chapter.
Let D be a set of spatial objects. Each spatial object p is associated with a set of
tags p.Γ. Let U be a set of users. Each user u ∈ U is a triple (id , λ, ν), where id is
the user’s identifier, λ is the user’s location, and ν is a vector of the user’s interests for
17
Table 3.1: Summary of notationsNotation DefinitionD a set of spatial objectsU a set of usersI(u, T ) the interest of user u on the tag set TGk a group of size kI(Gk, q.T ) the group interest on the query keywords q.TD(Gk) the diameter of a group Gk
Gk(ui) a group including ui
Gk(ui) a group set in which all the groups include ui
rankuq (Gk(ui)) the ranking upper bound of group Gk(ui)
C(ui, ui, uj) a circle centered at ui with radius ui, uj
the tags that are associated with the spatial objects checked in by the user. The interest
value for a set of tags is defined in Definition 3.1. We may use interest and interest value
interchangeably.
Definition 3.1. Let Du be the set of spatial objects checked in by user u and let Dt be
the set of spatial objects that are associated with tag t. Function Count(u, p) counts the
times of spatial object p checked in by user u. User u’s interest value for tag t is computed
as:1
I(u, t) =
∑p∈Du∧p∈Dt
Count(u, p)∑p∈Du
Count(u, p). (3.1.1)
Given a set of tags T , if a user’s interest value for every tag t ∈ T is positive, we say the
user fully covers T . Specifically, the interest value of a user u for a tag set T is defined
as:
I(u, T ) =∑t∈T
I(u, t). (3.1.2)
As an example, Table 3.2 shows an interest vector. Higher values indicate higher
interest. For example, the user is more interested in ‘hotel’ than ‘sport’. The user’s
interest value for tags ‘hotel’ and ‘sport’ is 0.36+0.2=0.56.
Table 3.2: Example Interest Vector
Tag movie airport hotel music sportInterest Value 0.10 0.20 0.36 0.14 0.20
We follow existing works [11, 46, 63] to define a ranking function as a weighted sum
1The definition of user interest based on check-in counts is adopted due to its simplicity. We can alsouse other models, such as user ratings or likes/dislikes, to quantify the user interest; and no any modificationon the query processing algorithm is needed.
18
of the normalized group interest and group diameter.2 It ranks a user group Gk of size k
with regard to a query q, denoted by rankq(Gk):
rankq(Gk) = αI(Gk, q.T )
Imax (q.T )+ (1− α)(1− D(Gk)
Dmax
), (3.1.3)
Here q.T is a set of query keywords that belong to the tag space of the dataset, I(Gk, q.T )
is the group interest on the query keywords q.T , which is defined as the minimum interest
of the users in the group if Gk fully covers q.T , i.e.,
I(Gk, q.T ) =
min{I(u, q.T ) | u ∈ Gk} if the users in Gk jointly fully cover q.T
0 otherwise(3.1.4)
and D(Gk) is the diameter of group Gk, i.e., the Euclidean distance between the farthest
pair of users in the group,
D(Gk) = max{||ui.λ uj.λ|| | ui, uj ∈ Gk}, (3.1.5)
where ||ui.λ uj.λ|| is the Euclidean distance between two users. Parameter α ∈ [0, 1] is
used to balance the group interest and the group diameter.
Definition 3.2. A Spatial-aware Interest Group (SIG) query q = (T, k) consists of two
parameters: a set of keywords q.T and the size k of the requested user group. It re-
trieves a user group of size k where each user is interested in the query keywords and the
users are close to each other in the Euclidean space, meaning that the ranking function
(Equation 3.1.3) is maximized.
Example 3.1. Figure 3.1 illustrates an example SIG query with three different values of α
in the ranking function (Equation 3.1.3). The circles, squares, and triangles in the figure
depict the locations of a set of users. Given an SIG query q, the sizes of those shapes
indicate the user interests for a set of query keywords q.T . The bigger the size, the higher
the user interest. Query q requests a user group of size 4 that maximizes the ranking2Note that the group interest and diameter factors are not directly comparable. In order to treat these two
factors fairly, we use the global maximum group diameter Dmax and maximum group interest Imax (q.T )to normalize them so that they will be kept in the same value domain [0, 1].
19
f
α = 0
α = 0.5
α = 1
Figure 3.1: An example of SIG query
function. The gray circles are the result group when α = 0, i.e., only the group diameter
is considered. The gray squares are the result group when α = 0.5. The gray triangles
represent the query result when α = 1, i.e., only the group interest is considered.
Theorem 3.1. The SIG query problem is NP-complete.
Proof. We establish the hardness by a reduction from a classical NP-complete problem,
namely the minimum set cover problem (MSC). An instance of the MSC problem consists
of a universe set U = {e1, e2, . . . , en}, a collection of sets S = {S1, S2, . . . , Sm}, where
Si is a subset of U and an integer k. The decision problem of MSC is to find whether
there is a subset S ′ ⊆ S, such that |S ′| ≤ k and the union of S ′ fully covers U .
Given an instance of MSC, we construct an instance of SIG query q = (T, k) on a set
of users. Each element ei in U is a keyword ti in q.T , each set Si is a user ui, and the
elements in Si are ui’s interests (keywords). We set the value α of the SIG query to 1. We
remark that Imax(q.T ) is a constant under this setting. Thus, maximizing rankq(Gk) is
equivalent to maximizing I(Gk, q.T ).
Suppose that we have a PTIME algorithm A that returns the query answer Gk =
{u′1, u′2, . . . , u′k} of an SIG query. There are two cases. Case 1: If q.T is fully covered
by the interests of Gk, then {S ′1, S ′2, . . . , S ′k} fully covers U and its size is k. Therefore,
a solution of the MSC is found. Case 2: If Gk does not fully cover q.T , then there does
not exist another group G′k, such that the interests of G′k fully cover q.T . Otherwise, G′k
would be returned as an answer as in Case 1. Therefore, with such aGk, one can conclude
20
that the MSC instance does not have a solution. By using A, the MSC problem is solved
in PTIME, a contradiction. Therefore, there does not exist a PTIME algorithm A that can
solve the SIG query problem.
3.2 Proposed Approaches
In this section, we present two efficient algorithms, namely Interest Oriented Algorithm
(IOAIR) and Diameter Oriented Algorithm (DOAIR), for the processing of SIG queries
based on the IR-tree [18]. Section 3.2.1 introduces the index structure IR-tree. Sec-
tion 3.2.2 presents the basic ideas of the two algorithms. Sections 3.2.3 and 3.2.4 elaborate
the two algorithms.
3.2.1 Preliminary: IR-Tree
We adopt the IR-tree index structure [18], where users are considered as spatial objects,
users’ locations and interest vectors are considered as the locations and documents of
objects, respectively. Figure 3.2(a) and Figure 3.2(b) show the users’ locations and its
corresponding IR-tree. IR-tree is essentially an R-tree attached with inverted files. The
leaf nodes in the IR-tree contain a number of entries of the form (u, u.λ), where u refers
to a user and u.λ is the MBR (minimum bounding rectangle) of the user’s location. Each
leaf node also includes a pointer to an inverted file that indexes the interest vectors of the
users stored in the node.
Each non-leaf node in the IR-tree includes several entries in the form of (ch,mbr),
where ch is the identifier of a child node and mbr is the MBR of all rectangles in the
child nodes. Each non-leaf node also includes a pointer to an inverted file that indexes the
pseudo interest vectors of the entries stored in the node. The pseudo interest vector of an
entry contains all the tags that appear in its child nodes. The interest value for each tag is
the maximum value in the child nodes.
We remark that the user locations (e.g., when they are referred to home/office address-
es) may not be frequently updated. When the user location changes, the IR-tree should
21
u1
u3
R5
R4
R7
R6
R2
R1
u2
u9
u5
u13
R8
u4
u11
u10
u8
u12
u7
u6
R3
(a) R-tree
R6 R7
R1 R2 R4 R5
u1 u2 u3 u4
R1 R2 R3
R6 R7
R8
Coffee (R6, 0.7), (R7, 0.7)
Tea (R6, 0.8), (R7, 0.6)
Coffee (R2, 0.7), (R1, 0.6), (R3, 0.3)
Tea (R3, 0.8), (R1, 0.5), (R2, 0.2)
Coffee (R4, 0.7), (R5, 0.6)
Tea (R5, 0.6), (R4, 0.2)
Coffee (u13, 0.6), (u11, 0.5), (u12, 0.4)
Tea (u12, 0.6), (u13, 0.5)Coffee (u2, 0.6), (u1, 0.4)
Tea (u1, 0.5)
Coffee (u4, 0.7), (u3, 0.6), (u5, 0.5)
Tea (u3, 0.2), (u4, 0.1)
Coffee (u6, 0.3), (u7, 0.2)
Tea (u7, 0.8), (u6, 0.4)
u5 u6 u7 u8 u9 u10 u11 u12 u13
R3
Coffee (u9, 0.7), (u8, 0.6), (u10, 0.5)
Tea (u8, 0.2), (u9, 0.1)
R4 R5
(b) IR-tree
Figure 3.2: Tree Index Structure
be updated accordingly to support efficient query processing. Fortunately, this can be
well handled by the embedded updating mechanism of IR-tree, whose efficiency has been
demonstrated in [18].
3.2.2 Overview
To avoid enumerating all possible groups, algorithms IOAIR and DOAIR construct groups
in special orders. If the ranking score of the current found group is higher than the upper
bounds on the ranking scores of the unseen groups, the current found group is returned as
the result. The derivation of the upper bound on the ranking score of a group is shown in
Theorem 3.2.
Theorem 3.2. Let Dmin be the distance between the closest pair of users in the dataset.
22
An upper bound on the ranking score of a group Gk(ui) of size k containing user ui is
rankuq (Gk(ui)) = α
I(ui, q.T )
Imax (q.T )+ (1− α)(1− Dmin
Dmax
). (3.2.6)
Proof. According to Equation 3.1.4, we have I(Gk(ui), q.T ) ≤ I(ui, q.T ). And since
Dmin ≤ D(Gk(ui)), we derive
rankuq (Gk(ui)) = α
I(ui, q.T )
Imax (q.T )+ (1− α)(1− Dmin
Dmax
)
≥ αI(Gk(ui), q.T )
Imax (q.T )+ (1− α)(1− D(Gk(ui))
Dmax
)
= rankq(Gk(ui)).
3.2.3 Interest Oriented Algorithm
Interest Oriented Algorithm (IOAIR) classifies groups in terms of users’ interests. Let
set Gk contain all possible user groups of size k. Let set Gk(ui) cover all the group-
s of size k that contains user ui and have the same level of interest as ui, i.e., ∀G ∈
Gk(ui)(I(G, q.T ) = I(ui, q.T )). Obviously, ∪ui∈UGk(ui) = Gk. Algorithm IOAIR fol-
lows the descending order of the user interest and iteratively constructs the group Gk(ui)
with the maximum ranking score in Gk(ui). If the ranking score of the current construct-
ed group Gk(ui) is higher than the upper bound on the ranking score of the next group
Gk(ui+1) (termination condition), the current found group is returned as the result. The
correctness of the termination condition is guaranteed by Lemma 3.1. The correctness of
algorithm IOAIR is guaranteed by Theorem 3.3.
Lemma 3.1. Let S = {u1, u2, · · · , un} be a sorted list of users in descending order of
their interests. If rankq(Gk(ui)) > rankuq (Gk(uj )), we have rankq(Gk(ui)) > rankq(Gk(um)),
where k ≤ i < j ≤ m ≤ n.
Proof. For j ≤ m ≤ n, we have I(uj, q.T ) ≥ I(um, q.T ). According to Equation 3.2.6,
we derive rankuq (Gk(uj )) ≥ ranku
q (Gk(um)). Hence, we get rankq(Gk(ui))> rankuq (Gk(uj ))
23
≥ rankuq (Gk(um)) ≥ rankq(Gk(um)) based on Theorem 3.2.
Theorem 3.3. Algorithm IOAIR finds the correct answer to an SIG query.
Proof. We prove it by contradiction. Assume that given an SIG query q, algorithm IOAIR
returnsG as the result. Now suppose there existsG′ with the maximum ranking score such
that rank q(G) < rank q(G′). Let ui be the user with the minimum interest in G and u′i be
the user with the minimum interest in G′. Hence, we have G and G′ are the groups with
maximum ranking score in Gk(ui) and Gk(u′i), respectively. There are three possible cases.
(1) If I(ui, q.T ) < I(u′i, q.T ), algorithm IOAIR first considers Gk(u′i), and then Gk(ui).
According to Lemma 3.1, we have rank q(G′) ≥ ranku
q (G) ≥ rank q(G). Thus, algorithm
IOAIR must return the groupG′ in Gk(u′i), notG in Gk(ui). (2) If I(ui, q.T ) = I(u′i, q.T ),
we have Gk(u′i) = Gk(ui). Algorithm IOAIR must return the group G′ since rank q(G) <
rank q(G′). (3) If I(ui, q.T ) > I(u′i, q.T ), algorithm IOAIR first considers Gk(ui), and
then Gk(u′i). According to Lemma 3.1, we have rank q(G) ≥ rankuq (G′) ≥ rank q(G
′),
which contradicts the assumption that rankuq (G) < rank q(G
′). Hence, the correctness of
algorithm IOAIR is proved.
Algorithm 1 shows the pseudo code of the IOAIR algorithm. The candidate group
Gk is initialized as the k-sized user group with the minimum diameter in (line 1). It
processes users in descending order of their interests for the query keywords (line 3) by
calling function GetNextUser that adopts the Threshold Algorithm [75]. For the current
obtained user ui, function IOAIRGetNextGroup constructs a group of size k containing
ui with the maximum ranking score (i.e., minimum diameter), denoted as Gk(ui), where
I(Gk(ui), q.T ) = I(ui, q.T ) (line 5). The constructed group Gk(ui) is assigned as the
candidate group if its ranking score is higher than that of the candidate group Gk (lines 6
and 7). If the ranking score of the candidate group is higher than the upper bound on the
ranking score of Gk(ui+1) that is the group of size k containing the next user ui+1 with
the maximum ranking score, the algorithm returns the candidate group as the result and
terminates (lines 8 and 9).
In order to find group Gk(ui) such that I(Gk(ui), q.T ) = I(ui, q.T ) with the maxi-
mum ranking score in Gk(ui), function IOAIRGetNextGroup (Algorithm 2) uses the IR-
24
Algorithm 1 IOAIR(Integer k, Keywords T , InvertedFile invf , IRTree irtree)1: Result Gk ← the k-sized user group with the minimum diameter;2: Dc ←∞;3: while ui, ui+1 ← GetNextUser(T, invf ) do;4: Update Dc according to Equation 3.2.7;5: Gk(ui)← IOAIRGetNextGroup(irtree, ui, k,Dc, T );6: if rankq(Gk (ui)) > rankq(Gk ) then7: Gk ← Gk(ui);8: if rankq(Gk ) > rankuq (Gk (ui+1 )) then9: Return Gk;
10: Return Gk;
tree to retrieve the users who have higher interest values than does ui and puts them inG′k.
Taking the advantage of the IR-tree where each entry in each node has an upper bound on
the users’ interests contained in the subtree pointed to by the entry, it is able to prune the
nodes whose interest is smaller than the interest of user ui, since no user in the subtree
can have a larger interest than does user ui (line 17). Since the interest of group Gk(ui)
is determined, constructing Gk(ui) with the maximum ranking score is equivalent to find
a group of size k with the minimum diameter from G′k (line 14). We apply backtracking
method to enumerate all possible k size groups, each group also needs to be checked if it
fully covers T .
Early Stop Let G′k contain all the users with higher interests than ui. It is possible
to prune some users in G′k so that Gk(ui) can be quickly found from G′k. Function
IOAIRGetNextGroup considers the users with higher interests than ui in ascending or-
der of their distances to ui. If k − 1 users have been obtained, a candidate group of size
k including ui is formed. If the diameter of the candidate group is not greater than the
distance between ui and the newly added user (line 7), the candidate group is the one
with the maximum ranking score. Otherwise, the candidate group is updated by consid-
ering the newly added user. The correctness is guaranteed by Theorem 3.4 (illustrated by
Example 3.2).
Theorem 3.4. Let S = {u1, u2, · · · , um, um+1, · · · , un} be a sorted list of users with
higher interests than ui and in ascending order of their distances to user ui. Let Gk(ui)
be the user group of size k containing ui with the maximum ranking score calculated from
25
Algorithm 2 IOAIRGetNextGroup(IRTree irtree, User ui, Integer k, Double Dc, Key-words T )
1: Queue ← NewPriorityQueue();2: Queue.Enqueue(irtree.root , 0);3: Add ui to G′k;4: while Queue is not empty do5: Entry e← Queue .Dequeue();6: if e refers to a user then7: if D(Gk) ≤ ||ui e|| then8: if D(Gk) < Dc then9: Return Gk;
10: else11: Return NULL;12: Add e to G′k;13: if G′k contains more than k users then14: Gk ← select the group of size k with the minimum diameter from G′k;
15: else16: for each entry e′ in the node pointed to by e do17: if the interest of e′ > the interest of ui then18: if ||ui e′|| < Dc then19: Queue.Enqueue(e′, ||ui e′||);20: Return Gk;
S ′ = {u1, u2, · · · , um}. If ||ui um+1|| ≥ D(Gk(ui)), Gk(ui) is the user group of size k
containing ui with the maximum ranking score calculated from S.
Proof. Suppose we can find a groupG′k(ui) of size k containing ui from S ′′ = {u1, u2, · · · ,
um, · · · , um+j} where 1 ≤ j ≤ n − m, such that rank q(G′k(ui)) > rank q(Gk(ui)).
Then we have ∃j(um+j ∈ G′k(ui)). Since ∀u ∈ S ′′(I(ui, q.T ) ≤ I(u, q.T )), we have
I(G′k(ui), q.T ) = I(Gk(ui), q.T ) and derive D(G′k(ui)) < D(Gk(ui)). Since um+j ∈
G′k(ui), we have D(G′k(ui)) ≥ ||ui um+j||. Since ||ui um+j|| ≥ D(Gk(ui)), we have
D(G′k(ui)) ≥ ||ui um+j|| ≥ D(Gk(ui)) that contradicts D(G′k(ui)) < D(Gk(ui)) de-
rived before and thus complete the proof.
Example 3.2. We illustrate Theorem 3.4 in Figure 3.3. Let S = {u1, u2, u3, u4, u5} be a
sorted list of users with higher interests than ui and in ascending order of their distances
to user ui. Let G4(ui) = {ui, u1, u2, u3} be the user group of size 4 containing ui with the
maximum ranking score calculated from S ′ = {u1, u2, u3}. The diameter is D(G4(ui)) =
||u1 u2||. Next, we consider u4 and have ||ui u4|| < D(G4(ui)). Hence, we obtain a new
group G4(ui) = {ui, u2, u3, u4} from S ′′ = {u1, u2, u3, u4} with the maximum ranking
26
ui
u1
u3
u2
u4uu1 u4
ui
u3
u22
u
u5
Figure 3.3: Example of Theorem 3.4
score and diameter D(G4(ui)) = ||ui u4||. We then consider u5 and have D(G4(ui)) <
||ui u5||. Theorem 3.4 guarantees that G4(ui) = {ui, u2, u3, u4} is the user group of size
4 containing ui with the maximum ranking score calculated from S.
Diameter Constraint When retrieving user group Gk(ui), it is not necessary to consid-
er all the users whose interests are higher than ui. We propose a diameter constraint Dc
that can be used to prune the search space (Theorem 3.5) so that less users are considered
when selecting a group Gk(ui) with the maximum ranking score.
Lemma 3.2. rankq(Gk(ui)) > rankq(Gk(uj )) ⇐⇒ D(Gk(ui)) < Dc, where
Dc = Dmax (1−rankq(Gk(uj ))− α I(ui,q.t)
Imax (q.t)
(1− α)). (3.2.7)
The proof can be easily derived based on Equation 3.1.3 and thus omitted.
Theorem 3.5. If rankq(Gk(ui)) > rankq(Gk(uj )) and ||ui um|| ≥ Dc, we have um /∈
Gk(ui).
Proof. We prove it by contradiction. Suppose um ∈ Gk(ui). Since ||ui um|| ≤ D(Gk(ui)),
27
we have
rankq(Gk(ui)) = αI(ui, q.T )
Imax (q.T )+ (1− α)(1− D(Gk(ui))
Dmax
)
≤ αI(ui, q.T )
Imax (q.T )+ (1− α)(1− ||ui um||
Dmax
)
≤ αI(ui, q.T )
Imax (q.T )+ (1− α)(1− Dc
Dmax
)
= rankq(Gk(uj )).
It contradicts the condition rankq(Gk(ui)) > rankq(Gk(uj )).
Lemma 3.2 indicates that the condition of a group with smaller interest having higher
ranking score is that its diameter must be small enough. Based on Lemma 3.2, Theo-
rem 3.5 guarantees that it prunes the users whose distances to ui is larger than Dc, since it
is impossible to construct a group containing those users and having higher ranking score
than does the candidate group (illustrated by Example 3.3). Hence, the search space is
pruned (line 18 in Algorithm 2). Only the group with higher ranking score than the can-
didate group is returned (line 8 in Algorithm 2). The value of Dc is updated when a user
group with higher ranking score is found (line 7 in Algorithm 1).
In order to facilitate our example description, we set q.T = ‘coffee’, and the value of
user’s interest in q.T is shown in Figure 3.2(b).
Example 3.3. Figure 3.2(a) shows the location layout of the users appearing in the IR-
tree of Figure 3.2(b). Let Dmax = 100, α = 0.5, Imax = 1.0, I(u1, q.T ) = 0.4, and the
current maximum ranking score be 0.6. Suppose the current processing group is Gk(u1).
Based on Lemma 3.2, we can obtain Dc = 20. Thus the diameter of Gk(u1) should be
less than 20. Figure 3.4 shows the distances between u1 and IR-tree nodes or its neighbor
users. With the diameter constraint Dc, we do not need to consider tree nodes {R3} and
users {u6, u7, u10, u13} during the query processing.
28
0 6 20 9 10 0 10 0
7 8 6 18 21 20 17 9 25 11 15 24
Figure 3.4: Distance between u1 and its neighbors
3.2.4 Diameter Oriented Algorithm
Diameter Oriented Algorithm (DOAIR) classifies groups in terms of group diameters. Let
set Gk contain all possible user groups of size k. Let set Gk(ui, ·) cover all the groups of
size k, taking user ui as one end of the group diameter. Obviously, ∪ui∈UGk(ui, ·) = Gk.
Note that Gk(ui, ·) may be an empty set. Algorithm DOAIR follows the descending order
of the user interest and constructs the group Gk(ui, ·) with the maximum ranking score in
Gk(ui, ·). If the ranking score of the current found groupGk(ui, ·) is higher than the upper
bound on the ranking score of the next group Gk(ui+1, ·) (termination condition), the
current found group is returned as the result. The correctness of the termination condition
is guaranteed by Lemma 3.3. The correctness of algorithm DOAIR is guaranteed by
Theorem 3.6.
Lemma 3.3. Let S = {u1, u2, · · · , un} be a sorted list of users in descending order
of their interests. If rankq(Gk(ui , ·)) > rankuq (Gk(uj , ·)) where ranku
q (Gk(uj , ·)) =
rankuq (Gk(uj )) (Equation 3.2.6), we have rankq(Gk(ui , ·)) > rankq(Gk(um , ·)), where
k ≤ i < j ≤ m ≤ n.
Proof. For j ≤ m ≤ n, we have I(uj, q.T ) ≥ I(um, q.T ). According to Equation 3.2.6,
we derive rankuq (Gk(uj , ·)) ≥ ranku
q (Gk(um , ·)). Hence, we get rankq(Gk(ui , ·)) >
rankuq (Gk(uj , ·)) ≥ ranku
q (Gk(um , ·)) ≥ rankq(Gk(um , ·)) based on Theorem 3.2.
Theorem 3.6. Algorithm DOAIR find the correct answer to an SIG query.
Proof. We prove it by contradiction. Assume that given an SIG query q, algorithm
DOAIR returns G as the result. Now suppose there exists G′ with the maximum rank-
ing score such that rank q(G) < rank q(G′). Suppose G and G′ are the groups with
29
maximum ranking score in Gk(ui, ·) and Gk(u′i, ·), respectively. There are three possi-
ble cases. (1) If I(ui, q.T ) < I(u′i, q.T ), algorithm DOAIR first considers Gk(u′i, ·), and
then Gk(ui, ·). According to Lemma 3.3, we have rank q(G′) ≥ ranku
q (G) ≥ rank q(G).
Thus, algorithm DOAIR must return the group G′ in Gk(u′i, ·), not G in Gk(ui, ·). (2)
If I(ui, q.T ) = I(u′i, q.T ), we have Gk(u′i, ·) = Gk(ui, ·). Algorithm DOAIR must re-
turn the group G′ since rank q(G) < rank q(G′). (3) If I(ui, q.T ) > I(u′i, q.T ), al-
gorithm DOAIR first considers Gk(ui, ·), and then Gk(u′i, ·). According to Lemma 3.3,
we have rank q(G) ≥ rankuq (G′) ≥ rank q(G
′), which contradicts the assumption that
rankuq (G) < rank q(G
′). Hence, the correctness of algorithm DOAIR is proved.
Algorithm 3 shows the pseudo code of the DOAIR algorithm. The candidate groupGk
is initialized as the k-sized user group with the minimum diameter (line 1). It processes
users in descending order of their interests for the query keywords (line 3) by calling
function GetNextUser that adopts the Threshold Algorithm [75]. For the current obtained
user ui, function DOAIRGetNextGroup constructs a group of size k with the maximum
ranking score, taking user ui as one end of the group diameter, denoted as Gk(ui, ·) (line
7). If its ranking score is higher than that of the candidate group Gk (lines 8 and 9). If the
ranking score of the candidate group is higher than the upper bound of Gk(ui+1, ·), the
algorithm then returns the candidate group as the result and terminates (lines 10 and 11).
Algorithm DOAIR is able to skip the construction of group Gk(ui, ·) if the distance Dui
between ui and its nearest neighbor is larger than Dc and the candidate group interest is
also higher than ui’s interest (lines 5 and 6), meaning that it is impossible to find a group
taking ui as one end of the group diameter with higher ranking score than the candidate
group. Theorem 3.7 guarantees the correctness of this pruning step.
In order to find group Gk(ui, ·) with the maximum ranking score in Gk(ui, ·), function
DOAIRGetNextGroup (Algorithm 4) uses the IR-tree to retrieve the users in ascending
order of their distances to ui (line 19). For an encountered user e, it tries to construct a
group Gk of size k with diameter ui e (lines 9 and 10). To avoid enumerating all pos-
sible diameter ui e and find out group Gk(ui, ·) efficiently, an early stop condition (line
12) and two interest constraints (lines 14 and 18) are designed. The diameter contraint
30
Algorithm 3 DOAIR(Integer k, Keywords T , InvertedFile invf , IRTree irtree)1: Result Gk ← the k-sized user group with the minimum diameter;2: Dc ←∞;3: while ui, ui+1 ← GetNextUser(T, invf ) do;4: Update Dc according to Equation 3.2.7;5: if Dui > Dc then6: Continue;7: Gk(ui, ·)← DOAIRGetNextGroup(irtree, ui, k,Dc, T );8: if rankq(Gk (ui , ·)) > rankq(Gk ) then9: Gk ← Gk(ui, ·);
10: if rankq(Gk ) > rankuq (Gk (ui+1 , ·)) then11: Return Gk;12: Return Gk;
(Theorem 3.5) is also applied here (line 18).
Theorem 3.7. Let Gk be the candidate group and Duibe the distance between ui and its
nearest neighbor. If Dui> Dc, we have rankq(Gk) > rankq(Gk(ui , ·)) where
Dc = Dmax (1−rankq(Gk)− α I(ui,q.t)
Imax (q.t)
(1− α)).
Proof. We derive rankq(Gk) > rankq(Gk(ui , ·)) as follows:
rankq(Gk(ui , ·)) = αI(G(ui, ·), q.T )
Imax (q.T )+ (1− α)(1− D(Gk(ui, ·))
Dmax
)
≤ αI(ui, q.T )
Imax (q.T )+ (1− α)(1− Dui
Dmax
)
< αI(ui, q.T )
Imax (q.T )+ (1− α)(1− Dc
Dmax
)
= rankq(Gk).
Early Stop Function DOAIRGetNextGroup considers the users in ascending order of
their distances to ui. Let Gk be the current found group with diameter ui e. If the interest
ofGk equals the interest of ui, groupGk is the group with the maximum score in Gk(ui, ·).
All the rest users farther than e from ui do not need to be considered. The correctness is
guaranteed by Theorem 3.8.
31
Algorithm 4 DOAIRGetNextGroup(IRTree irtree, User ui, Integer k, Double Dc, Key-words T )
1: Queue ← NewPriorityQueue();2: Queue.Enqueue(irtree.root , 0);3: Ic←0,4: Add ui to G′k;5: while Queue is not empty do6: Entry e← Queue .Dequeue();7: if e refers to a user then8: Add e to G′k;9: if G′k contains more than k users then
10: Gk ← GetCurrentResult(G′k, ui, e, T );11: if Gk is not empty then12: if the interest of Gk = the interest of ui then13: Return Gk;14: Update Queue and G′k, delete the users whose interest ≤ the interest of Gk
(Theorem 3.10).15: else16: for each entry e′ in the node pointed to by e do17: Update Ic according to Theorem 3.9;18: if the interest of e′ > Ic ∧ ||e′ ui|| < Dc then19: Queue .Enqueue(e′, ||ui e′||);20: Return Gk;
Theorem 3.8. Let S = {u1, u2, · · · , um, um+1, · · · , un} be a sorted list of users in as-
cending order of their distances to user ui. Let Gk(ui, um) be the user group with di-
ameter ui um and rank q(Gk(ui, um)) > rank q(Gk(ui, uj)) where 1 ≤ j < m. If
I(Gk(ui, um), q.T ) = I(ui, q.T ), rank q(Gk(ui, um)) > rank q(Gk(ui, uj)) where 1 ≤
j ≤ n, j 6= m.
Proof. Suppose we can find a group Gk(ui, uj) with diameter ui uj where m < j ≤ n,
such that rank q(Gk(ui, um)) < rank q(Gk(ui, uj)). Since D(Gk(ui, um)) = ||ui um|| <
||ui uj|| = D(Gk(ui, uj)) and I(Gk(ui, um), q.T ) = I(ui, q.T ) ≥ I(Gk(ui, uj), q.T ), we
can derive rank q(Gk(ui, um)) > rank q(Gk(ui, uj)) that contradicts the assumption and
thus complete the proof.
Interest Constraint Ic In function IOAIRGetNextGroup, when retrieving user group
Gk(ui), it uses a distance constraint Dc to prune the search space. Besides Dc, function
DOAIRGetNextGroup contains an interest constraint Ic to further prune the search space
such that selecting a group Gk(ui, ·) with the maximum ranking score is more efficient.
32
Specifically, if the interest of a user e is lower than Ic, the ranking score of the group with
diameter ui e is lower than the current found candidate group.
Lemma 3.4. Suppse um and un are the mth and nth nearest neighbors of ui, where m <
n. We have, rank q(Gk(ui, um)) < rank q(Gk(ui, un)) ⇐⇒ I(Gk(ui, un), q.T ) > Ic,
where
Ic =Imax(q.T )
α(rank q(Gk(ui, um))− (1− α)(1− ||ui un||
Dmax
)). (3.2.8)
The proof is trivial and thus omitted (easily derived based on Equation 3.1.3).
Theorem 3.9. Let um and un be the mth and nth nearest neighbors of ui, where m < n.
Let Gk(ui, um) be the current found group with maximum ranking score. If I(un, q.T ) <
Ic, we have rankq(Gk(ui , um)) > rankq(Gk(ui , un)).
Proof. We prove it by contradiction. Assume rankq(Gk(ui , um)) ≤ rankq(Gk(ui , un)).
Then we have Ic > I(un, q.T ) ≥ I(Gk(ui, un), q.T ) that contradicts Lemma 3.4.
Interest Constraint IG Let Gk(ui, um) be the current found candidate group and IG =
I(Gk(ui, um), q.T ). When constructing groupGk(ui, un) where ||ui um|| < ||ui un||, the
users whose interest is lower than IG can be pruned. In other words, if group Gk(ui, un)
is successfully constructed such that rankq(Gk(ui , um)) < rankq(Gk(ui , un)), the users
whose interest is lower than IG must not belong to group Gk(ui, un). It is used to prune
the search space when constructing a group with specific diameter. The correctness is
guaranteed by Theorem 3.10.
Theorem 3.10. Let um and un be the mth and nth nearest neighbors of ui, where m < n.
If rankq(Gk(ui , um)) < rankq(Gk(ui , un)), ∀uj(I(uj, q.T ) ≤ IG) =⇒ uj /∈ Gk(ui, un).
Proof. Assume ∃uj(I(uj, q.T ) ≤ IG, uj ∈ Gk(ui, un). SinceD(Gk(ui, um)) = ||ui um|| <
||ui un|| = D(Gk(ui, un)) and I(Gk(ui, un), q.T ) ≤ I(uj, q.T ) ≤ IG = I(Gk(ui, um), q.T ),
we have rankq(Gk(ui , um)) ≥ rankq(Gk(ui , un)) that results in a contradiction.
Given two users ui and e, function GetCurrentResult (Algorithm 5) is invoked by func-
tion DOAIRGetNextGroup (Algorithm 4) to construct group Gk(ui, e) with the maximum
33
ranking score. Based on Lemma 3.5, function GetCurrent-Result first constructs C(uie)
to minimize the search space of Gk(ui, e) (line 1, illustrated by Example 3.4). If the size
of C(uie) is less than k-2 or C(uie) cannot fully cover T , it returns NULL (lines 2–3), be-
cause it is impossible to formulate a group Gk(ui, e) with less than k-2 users. Otherwise,
we use the line segment uie to partition the search space C(uie) into two sets GL and GR
(lines 4–5). The users from GL and GR whose interests are no less than that of ui and e
are put into GLU and GRU , respectively (lines 6–7). If there are no less than k-2 users in
GLU or GRU , we randomly select k-2 users from GLU and GRU such that their union with
{ui, e} fully covers T , and return them together with {ui, e} as the result (lines 8–10,
illustrated by Example 3.5). The correctness is guaranteed by Lemma 3.6. We then split
the users into two sets Gup and Gdown according to Imin{ui, e} (here, Imin{ui, e} denotes
the minimum interest of ui and e) (lines 11-12). Gup represents the user set whose interest
is no less than Imin{ui, e}, and Gdown represents the user set whose interest is less than
Imin{ui, e}. Afterwards, we iteratively construct Gk(ui, e) with the users in Gdown in a
decreasing order of group interest (lines 13–20). In other words, we prefer to search the
group with high interest, because a higher interest means a higher ranking score for group
Gk(ui, e) when the diameter uie is known. In each iteration, if there exists k-2 users such
that the distance between any pair of users is no more than ||ui e|| and their union with
{ui, e} fully covers T (by enumerating all possible groups of size k-2), Gk ∪ {ui, e} is
returned as the result (lines 15–17). Otherwise, the user p with the highest interest in
Gdown will be moved into Gup (lines 18–19). This process is repeated until all users in
Gdown have been checked.
Lemma 3.5. Let C(ui, uiuj) and C(uj, uiuj) be two circles centered at ui and uj with
the same raduis ||ui uj||. The intersection of C(ui, uiuj) and C(uj, uiuj) is denoted by
C(uiuj). We have ∀u ∈ Gk(ui, uj)(u ∈ C(uiuj)).
The proof is obvious and thus omitted.
Example 3.4. Figure 3.5 shows the two circles C(u1, u1u11) and C(u11, u1u11) centered
at u1 and u11 with radius ||u1 u11||. The intersection C(u1u11) covers 3 users (e.g., u3, u4
34
Algorithm 5 GetCurrentResult(Group G′
k, User ui, User e, Keywords T )1: C(uie)← Users from {u|u ∈ G′
k ∧ ‖|u e|| ≤ ||ui e||};2: if |C(uie)| < k − 2 or C(uie) cannot fully cover T then3: Return NULL;4: GL ← Users from C(uie) that are above the line segment uie;5: GR ← Users from C(uie) that are below the line segment uie;6: GLU ← Users from GL whose interest are no less than Imin{ui, e};7: GRU ← Users from GR whose interest are no less than Imin{ui, e};8: if |GLU | ≥ k − 2 or |GRU | ≥ k − 2 then9: Gk ← Select any k − 2 users from GLU or GRU such that their union with {ui, e} fully
covers T ;10: Return Gk ∪ {ui, e};11: Gup ← GLU ∪GRU ;12: Gdown ← C(uie)−Gup;13: Queue← Sort the users in Gdown according to their interest in descending order;14: repeat15: Gk ← k − 2 users from Gup such that the distance between any pair of users is no more
than ||ui e|| and their union with {ui, e} fully covers T ;16: if Gk is not empty then17: Return Gk ∪ {ui, e};18: User p← Queue.Dequeue();19: Gup ← Gup ∪ {p};20: until Queue is empty21: Return NULL;
and u9). Group Gk(u1, u11) can be constructed from the users inside C(u1u11). In other
words, the search space of Gk(u1, u11) is C(u1u11).
Lemma 3.6. If the number of users on one side s of diameter uiuj inside C(uiuj) is no
less than (k − 2) and their interest is no smaller than the minimum interest of ui and uj ,
group Gk(ui, uj) can be constructed by randomly selecting k − 2 users from s, including
ui and uj .
Proof. Since k − 2 users are selected from s, the distance between all pair users in
Gk(ui, uj) is no larger than diameter ||ui uj||. Hence, D(Gk(ui, uj)) = ||ui uj|| and
I(Gk(ui, uj), q.T ) = min{I(ui, q.T ), I(uj, q.T )}. The ranking score of Gk(ui, uj) is
maximized.
Example 3.5. Consider the 13 users of IR-tree in Figure 3.5. Based on Lemma 3.5, the
search space of G4(u1, u11) is C(u1u11) which contains 2 users {u3, u4} whose interest is
no less than 0.4 (the minimum interest of u1 and u11). Based on Lemma 3.6, due to u3 and
35
u1
u3
R5
R4
R7
R6
R2
R1
u2
u9
u5
u13
R8
u4
u11
u10
u8
u12
u7
u6
R3
Figure 3.5: Constructing G4(u1, u11)
u4 are in one side of the diameter u1u11 and 2≥4-2, thus {u3, u4}⋃{u1, u11} is returned
as G4(u1, u11) with maximum ranking score.
3.3 Performance Evaluation
This section describes the experiments used to evaluate the algorithms proposed for the
processing of SIG queries (i.e., IOAIR and DOAIR). We also consider a baseline algo-
rithm that is similar to algorithm IOAIR, but using the traditional R-tree index without
diameter constraint. We introduce the datasets and queries used in Section 3.3.1 and the
experiment setup in Section 3.3.2. The experimental results are presented in Section 3.3.3.
3.3.1 Datasets and Queries
We collect data from two popular location-based social networks in China, i.e., Jiepang3
and Dianping4. Jiepang provides the check-in service for the visitors who may check-in
the tourist places they like. Dianping provides the check-in service for the users to share
review comments on the POIs such as restaurants they prefer. The properties of these two
real datasets are shown in Table 3.3 below.
We randomly generate two query sets on the two datasets. The query set on Jiepang
contains 200 queries, while the query set on Dianping contains 500 queries. Each query
3http://www.jiepang.com4http://www.dianping.com
36
Table 3.3: Dataset PropertiesJiepang Dianping
Total # of users 353,493 2,053,214Total # of spatial objects 244,331 1,466,188Total # of check-in actions 5,250,466 17,527,599Total # of unique tags 2,101 153,211Average # of tags per spatial object 2 23Average # of tags per user interest 1.3 37
contains several keywords and the specified group size. The keywords are randomly gen-
erated from the tag set of the dataset. The number of the keywords varies from 1 to 5. The
group size k is assigned to values {20, 40, 60, 80, 100}.
3.3.2 Setup
The indexes, including the R-tree, the inverted file, and the IR-tree, used in this chapter
are disk resident. The page size is set to 4KB. The fanouts of the R-tree and the IR-tree
are both set at 100. All the algorithms are implemented in Java programming language.
The models of the CPU and RAM are Intel Core 2 Quad Processor 2.4G Hz and 4GB
DDR3 memory, respectively. The default values of k, α, and the number of query tags
are 50, 0.5, and 1, respectively.
3.3.3 Experimental Results
We evaluate the performance of the three algorithms when varying the value of parameter
k, α, the number of keywords, and the buffer size. We also test the scalability of the pro-
posed algorithms on the two different datasets. As in many other performance evaluation
on query processing, we report the overall performance using the average elapsed time
and the average I/O cost.
Varying group size k. In this experiment, we evaluate the performance of our proposed
algorithms varying the group size k. Figure 3.6 shows the average elapsed time and the
simulated I/O cost on Jiepang and Dianping datasets. The IOAIR and DOAIR algorithm-
s outperform the baseline approach for all values of k in terms of both metrics, since
the IR-tree is able to prune irrelevant leaf nodes whose interest is less than the current
interest constraint as early as possible. Notably, the algorithm DOAIR achieves much
37
10
100
1000
10000
100000
1e+06
1e+07
20 40 60 80 100
mill
isec
onds
k
BaselineIOAIR
DOAIR
10
100
1000
10000
100000
1e+06
1e+07
20 40 60 80 100
page
acc
esse
s
k
BaselineIOAIR
DOAIR
(a) Varying k on Jiepang
100
1000
10000
100000
1e+06
1e+07
1e+08
20 40 60 80 100
mill
isec
onds
k
BaselineIOAIR
DOAIR
10
100
1000
10000
100000
1e+06
1e+07
20 40 60 80 100
page
acc
esse
s
k
BaselineIOAIR
DOAIR
(b) Varying k on Dianping
Figure 3.6: Varying k
better performance than IOAIR. This is because Theorems 3.9 and 3.10 effectively prune
a significant amount of search space and Lemmas 3.5 and 3.6 assist to reduce distance
computation time for DOAIR.
Varying α. Parameter α is used to balance the group interest and the group diameter.
Users can adjust α to determine query results bias to interest or diameter. Figure 3.7
shows the performance of the three algorithms with different values of α. As discussed
in Section 3.2, IOAIR is a slight bias towards finding the k-size maximum interest group
with higher group interest, while DOAIR is in favour of searching the maximum interest
group with a smaller group diameter. When α is varied from 0.1 to 0.9, the average e-
lapsed time and the average simulated I/O cost of DOAIR on both datasets are increasing,
but that of IOAIR is decreasing. Owing to the advantages of DOAIR presented in the sec-
tion above, DOAIR still has overall better performance than IOAIR algorithm. However,
when α is high and the group size k is small, IOAIR achieves better performance than
DOAIR (see Figure 3.8). The reason is two-folded. First, IOAIR has the priority to deal
with the group with a high group interest. Thus, with a large α value, IOAIR can find the
38
10
100
1000
10000
100000
1e+006
1e+007
0.1 0.3 0.5 0.7 0.9m
illis
econ
ds
alpha
BaselineIOAIR
DOAIR
10
100
1000
10000
100000
1e+006
1e+007
0.1 0.3 0.5 0.7 0.9
page
acc
esse
s
alpha
BaselineIOAIR
DOAIR
(a) Varying α on Jiepang
100
1000
10000
100000
1e+006
1e+007
1e+008
0.1 0.3 0.5 0.7 0.9
mill
isec
onds
alpha
BaselineIOAIR
DOAIR
100
1000
10000
100000
1e+006
1e+007
0.1 0.3 0.5 0.7 0.9pa
ge a
cces
ses
alpha
BaselineIOAIR
DOAIR
(b) Varying α on Dianping
Figure 3.7: Varying α
final results quickly and terminate the query processing early. Second, as discussed above,
the performance efficiency of DOAIR is much attributed to its strong pruning ability to
reduce the distance computation cost (based on Lemmas 3.5 and 3.6). When k is small,
the distance computation cost is not very high, thereby weakening the pruning effect of
DOAIR. Combining these two effects, IOAIR outperforms DOAIR when α = 0.9 and
k = 10.
Varying the number of query tags. In this experiment, we evaluate the performance
of our proposed algorithms by varying the number of the query tags. In most cases, the
10
100
1000
10000
100000
1e+006
1e+007
10 20 30
mill
isec
onds
k
BaselineIOAIR
DOAIR
Figure 3.8: Varying k on Dianping (α = 0.9)
39
number of the query tags is small, thus we only consider the cases where varying the
number from 1 to 5. In Figure 3.9, we can see that again DOAIR demonstrates the best
performance in all cases tested. As no user’s interest information is integrated into the R-
tree, the baseline algorithm cannot prune the irrelevant tree nodes whose interest does not
satisfy the interest constraint; thus the baseline algorithm performs the worst in running
time and I/O cost. As discussed earlier, DOAIR shows better performance than IOAIR
due to its stronger pruning power and reduced distance computation cost.
Varying buffer size. To some extent, the buffer size affects the algorithm performance.
The bigger size of buffer setting in the memory, the more disk pages are buffered, and thus
the less I/O cost is incurred. In this experiment, we adopt the LRU (Least Recently Used)
buffering strategy to cache the disk pages. Figure 3.10 shows that DOAIR outperforms
the other two algorithms in all settings. With increasing the buffer size, as expected the
average I/O cost decreases notably. The average elapsed time also keeps the decreasing
pattern, but the degree is not that significant. This is because the most time-consuming
part is to compute the SIG groups after the irrelevant tree nodes are pruned.
10
100
1000
10000
100000
1e+06
1e+07
1 2 3 4 5
mill
isec
onds
# of keywords
BaselineIOAIR
DOAIR
10
100
1000
10000
100000
1e+06
1e+07
1 2 3 4 5
page
acc
esse
s
# of keywords
BaselineIOAIR
DOAIR
(a) Varying the number of query tags on Jiepang
10
100
1000
10000
100000
1e+06
1e+07
1 2 3 4 5
mill
isec
onds
# of keywords
BaselineIOAIR
DOAIR
10
100
1000
10000
100000
1e+06
1e+07
1 2 3 4 5
page
acc
esse
s
# of keywords
BaselineIOAIR
DOAIR
(b) Varying the number of query tags on Dianping
Figure 3.9: Varying the number of query tags
40
0
20000
40000
60000
80000
100000
0 200 400 600 800 1000
mill
isec
onds
buffer size
BaselineIOAIR
DOAIR
0
2000
4000
6000
8000
10000
12000
14000
0 200 400 600 800 1000
page
acc
esse
s
buffer size
BaselineIOAIR
DOAIR
(a) Varying buffer size on Jiepang
0
200000
400000
600000
800000
1e+06
0 2000 4000 6000 8000 10000
mill
isec
onds
buffer size
BaselineIOAIR
DOAIR
0
20000
40000
60000
80000
100000
120000
0 2000 4000 6000 8000 10000
page
acc
esse
s
buffer size
BaselineIOAIR
DOAIR
(b) Varying buffer size on Dianping
Figure 3.10: Varying Buffer Size
100
1000
10000
100000
1e+06
50k 100k 150k 200k 250k 300k
mill
isec
onds
number of users
BaselineIOAIR
DOAIR
100
1000
10000
100000
50k 100k 150k 200k 250k 300k
page
acc
esse
s
number of users
BaselineIOAIR
DOAIR
(a) Varying the Number of Users on Jiepang
100
1000
10000
100000
1e+06
1e+07
300k 600k 900k 1200k1500k1800k
mill
isec
onds
number of users
BaselineIOAIR
DOAIR
100
1000
10000
100000
1e+06
1e+07
300k 600k 900k 1200k1500k1800k
page
acc
esse
s
number of users
BaselineIOAIR
DOAIR
(b) Varying the Number of Users on Dianping
Figure 3.11: Varying the Number of Users
41
Varying the number of users. With the purpose of testing the scalability of our proposed
algorithms, in this set of experiments we vary the number of users in the two testing
datasets. As shown in Figure 3.11, our proposed algorithms DOAIR and IOAIR exhibit
good scalability performance. As the number of users grows, the average elapsed time
and the average simulated I/O cost of these algorithms on both datasets increase more
slowly than the baseline algorithm, resulting in better performance improvement for larger
datasets.
3.4 Summary
In this chapter, we have presented a new SIG query that considers both the users’ spatial
locations and their common interest in query keywords. We have proposed a family of
efficient algorithms based on the IR-tree, namely IOAIR and DOAIR, for the efficient
processing of SIG queries. IOAIR processes SIG queries based on the descending order
of interest to search the result group with the minimum diameter. IOAIR integrates the
distance constraint into the query optimization to prune search space. In contrast, DOAIR
adapts a diameter-oriented strategy to process SIG queries, which takes into account the
interest and diameter order simultaneously. Effective pruning techniques have been de-
veloped to prune irrelevant search space and accelerate the search speed. The experiments
based on two real datasets demonstrate that the DOAIR algorithm achieves the best per-
formance and outperforms the baseline algorithm by orders of magnitude.
42
Chapter 4
Geo-Social K-Cover Group Queries for
Collaborative Spatial Computing
The emergence of geo-social data has enriched the studies on queries in location-based
social network. As mentioned in Chapter 1, geo-social group query is one of the most
important problems for collaborative spatial computing. In this chapter, we propose a
new type of geo-social queries, namely geo-social k-cover group (GSKCG) queries, by
considering the users’ spatial containment and their social connections. The rest of this
chapter is organized as follows. Section 4.1 formally formulates the problem and an-
alyzes its complexity. Section 4.2 presents the KCGFinder algorithm along with a set
of effective pruning techniques. Section 4.3 presents the Enhanced SaR-tree structure
and introduces the integrated SaRBasedKCGFinder algorithm. Experimental results are
provided in Section 4.4. Finally, we summarize this chapter in Section 4.5.
4.1 Problem Formulation
In this section, we give some preliminaries and provide the problem statement, followed
by an example to elaborate the problem defined. Table 4.1 summarizes the notations used
throughout this chapter.
A GSKCG query is defined over a location-based social network (LBSN)G = (V,E),
where each vertex u ∈ V is a user and each edge e ∈ E denotes an acquainted relation
43
Table 4.1: Summary of notationsNotation DefinitionG = (V,E) Location based Social Network (LBSN)u, v a user of Gu.R an associated region of user uQ = (k, P ) a GSCKG query Q, P is a set of query points, and k indicates a social constraint k-coreG[V ′] a subgraph of G contains vertices V ′
SI an intermediate solution where |SI| ≤ sSU the set of remaining usersPS the set of query points covered by users in SNBS(v) the number of v’s neighbors in SVp a set of users whose associated region covers query point pUk a set of users that may appear in a k-coreCs a group of size s
Ck a connected k-core componentCk
s a s-size group where G[Cks ] is a k-core and Ck
s fully covers PM the maximum group size of query GSKCGListP a sorted user list according to the increasing size of Vp where p ∈ PA(p) the index of the last user u in ListP where p ∈ u.Rk(u) a k-core with u insideCBRu,k a rectangle that does not contain a k(u)iCBRu,k an internal CBR of u that does not contain a k(u)eCBRu,k an external CBR of u that does not contain a k(u)MBP (P ) a minimum bound rectangle which contains the query point set P
between the two users it connects. For any two users u, v ∈ V , there exists an edge
(u, v) ∈ E if and only if u and v are familiar with each other. Moreover, each user
u ∈ V has an associated region denoted by u.R.1 Such an LBSN can be easily derived by
combining the location and social data collected from real-life applications.
A GSKCG query aims to find a group of users with a desired social relationship. In
this chapter, we quantify the desire of the social relationship within a user group in terms
of k-core [55], a widely used model for detecting community structures in a graph.
Definition 4.1. (k-core) For a graph G = (V,E), a connected subgraph G′ = (V ′, E ′)
of G is a k-core if every vertex v ∈ V ′ has at least degree k.
We argue that k-core is a reasonable model to measure a user group’s social acquain-
tance level for two main reasons. First, the minimum degree constraint of k-core is an
important measure of group cohesiveness in social science research [55] and has been
widely adopted in the research of graph problems [15, 53, 71]. In our problem, k-core
is effective and flexible to capture a user group’s acquaintance level in real-life LBSNs.
Second, k-core decomposition has a linear time complexity, which makes it appealing in
real-life applications. Indeed, it has been used as an important social constraint in prac-
1For ease of exposition, we consider each user to have one associated region. Our solution can be easilyextended to the case where a user has multiple associated regions, as discussed later in Section 4.3.3.
44
tical applications [58]. Based on the k-core model, we formally define a GSKCG query
below.
Definition 4.2. (GSKCG query) Given an LBSN G = (V,E), a Geo-Social k-Cover
Group (GSKCG) query is defined as a 2-tuple Q = (k, P ), where k is a positive integer,
indicating the social acquaintance constraint, and P = {p1, p2, · · · , pm} is a set of query
points, indicating the spatial coverage constraint, and returns a set of users V ′ ⊆ V such
that:
1. P ⊂⋃
u∈V ′ u.R,
2. the subgraph G[V ′] of G is a k-core, and
3. the cardinality of G[V ′] is minimum.
Note that we require the returned user group to have the minimum cardinality. This
requirement is naturally derived from the real-world demands. For example, in the mo-
tivating examples in Chapter 1, retrieving a minimum set of users normally leads to the
minimum employment cost or ease of reaching a consensus. We choose to make k as an
input parameter in order to provide a generic geo-social query service for different ser-
vice requesters. For a service requester that aims to find a single user who covers all the
tasks, he/she can set k = 0, which allows the GSKCG query to consider only the spatial
containment constraint, but not the social constraint. In many other cases, setting k to a
non-zero value will provide much more flexibility for a requester. For example, a service
requester can issue multiple GSKCG queries with different k values in parallel and then
select a proper group that fits his/her business needs.
Example 4.1. Consider a simple LBSN G = (V,E) where the users’ acquaintance rela-
tions and associated regions are shown in Figure 4.1(a) and Figure 4.1(b), respectively.
The GSKCG queryQ = (k, P ) with k = 2 and P = {p1, p2, p3, p4} returns the user group
V ′ = {u1, u3, u4} because: 1) the joint regions of users in V ′ can cover all the query
points in P ; 2) the subgraph G[V ′] of G is a 2-core; and 3) the cardinality of G[V ′] is
minimum among all user groups that satisfy the first two conditions.
45
u1
u2
u3
u4
u5
u6
(a) Social networks
u1
u2
u3
u4
u5 u6
p1
p2
p3
p4
(b) Associated regions
Figure 4.1: An example of a location-based social network for GSKCGquery
As formally defined in Definition 4.2, a GSKCG query finds a set of users that satisfy
the given spatial and social constraints. For ease of presentation, we call a user group valid
if it satisfies both Conditions 1 and 2 in Definition 4.2. Next we analyze the complexity
of the GSKCG query problem.
Theorem 4.1. GSKCG query is NP-complete.
Proof. We establish the hardness by a reduction from a classical NP-complete problem,
namely the minimum set cover (MSC) problem. An instance of the MSC problem consists
of a universe U = {e1, e2, . . . , en} and a set of sets S = {S1, S2, . . . , Sm}, where Si ⊂ U .
The decision problem is to decide if we can find a subset S ′ of S such that all the elements
in U are fully covered by S ′ and the size of S ′ is minimum.
Given an instance of MSC, we construct an instance of a GSKCG query Q = (k, P )
on a set of users. Each element ei in U corresponds to a spatial query point in P , each set
Si corresponds to a user ui’s associated region ui.R, and the elements in Si corresponds
to the spatial points in ui’s associated region ui.R. We consider the restricted case of
GSKCG query when k = 0. It can be seen that there exists a solution to the MSC problem
if and only if there exists a solution to Q (i.e., find a minimum set of users such that all
given query points are fully covered by their associated regions).
Suppose we have a polynomial-time algorithm A that returns the query answer G′ =
{u′1, u′2, . . . , u′m} to a GSKCG query Q. If P is fully covered by the associated regions
of G′, then {S ′1, S ′2, . . . , S ′m} fully covers U and its size m is minimum. This implies
46
that a polynomial-time solution to the MSC problem is found, leading to a contradiction.
Therefore, there does not exist a polynomial-time algorithm A for the GSKCG query
problem.
In this chapter, we study how to efficiently process GSKCG queries. We aim for an
optimal solution that has short response time. This is mainly achieved by a set of effective
pruning strategies (see Section 4.2) and a novel index structure (see Section 4.3).
4.2 Algorithm Design
In this section, we present our KCGFinder algorithm and a set of pruning strategies for
answering GSKCG queries.
4.2.1 Basic Algorithm
To satisfy the minimum cardinality requirement of a GSKCG query, the general idea of
KCGFinder is to process the user groups in increasing order of group size and return the
current group as soon as it is valid.
Algorithm 6 gives the pseudo code of the KCGFinder algorithm. Before performing
a search on the input LBSN G = (V,E), we first conduct two filtering operations: spatial
filtering and social filtering. In spatial filtering, we use an R-tree to get the users whose
associated regions cover at least one query point p ∈ P (Line 1, Algorithm 6). In social
filtering, we adopt the core decomposition algorithm [8] to identify the user set Uk in
which the users belonging to S may appear in a k-core, and invoke a depth-first search
(DFS) to find the set of connected componentsH ofG[Uk] that each fully covers P (Lines
2–3, Algorithm 6).
In Line 4 of Algorithm 6, we compute the maximum cardinalityM of the components
in H , which gives the upper bound of the size of the returned user group. By definition,
the cardinality of a k-core is >= k + 1. Thus, we enumerate user groups in increasing
order of size from k + 1 to M . Given a size s, for each component Ck with size ≥ s,
we invoke the GetOptimalGroup function (see Algorithm 13) to find a size-s user group
47
Algorithm 6 KCGFinder(Query points P , Integer k, LBSN G)1: S ← The set of users in G that each covers at least one point in P ;2: Uk ← The set of users belonging to S that may appear in a k-core;3: H ← All connected components of G[Uk] that each fully covers P ;4: M ←maxCk∈H |Ck|;5: for s from k+1 to M do6: for each Ck in H do7: if |Ck| ≥ s then8: Ck
s ← GetOptimalGroup(Ck, k, s, P );9: if Ck
s 6= ∅ then10: Return Ck
s ;11: Return ∅;
Algorithm 7 GetOptimalGroup (Component G, Integer k, Integer s, Query points P )1: for each size-s user group Cs of G do2: if the number of edges of G[Cs] ≥ k(k + 1)/2 then3: if G[Cs] is k-core and P ⊆
⋃u∈Cs
u.R then4: Return Cs;5: Return ∅;
Cks whose joint regions fully cover P (for short, we say “Ck
s covers P ”) and which is a
k-core. If Cks is not empty, it is returned as the final optimal answer to the GSKCG query.
It can be observed that the main complexity of KCGFinder comes from the GetOpti-
malGroup function. Therefore, in the rest of this section, we focus on how to optimize
GetOptimalGroup via a set of pruning techniques. We give the general idea of GetOp-
timalGroup in Algorithm 7. GetOptimalGroup enumerates all size-s user groups and
checks whether they are valid. By the definition of k-core, we can prune out a user group
Cs if the number of edges in G[Cs] is < k(k + 1)/2 (Line 2, Algorithm 7).
For a systematic enumeration of all candidate user groups, we employ the branch and
bound algorithm [36]. In the branch and bound search process, we keep track of two user
sets SI and SU, which represent the intermediate solution set and the set of remaining
users, respectively. Initially, SI is empty, and SU is the set of all users in component G.
We iteratively add users from SU to SI to check whether the resultant group is valid. This
process can be organized into a tree structure, as illustrated in Figure 4.2, in which an
internal node represents an SI and a leaf node represents a size-s candidate group. In the
rest of this section, we explore a set of effective pruning strategies to speed up the branch
and bound search.
48
u1
u1u2
u1u2u3
NULL
u1u3
u1u2u4
Expanding
Backtracking
u1u3u4
u1 u3
u4Social
Spatial
u1.R : {p1}
u3.R : {p2,p3}
u4.R : {p4}
P={p1,p2,p3,p4}
Figure 4.2: Branch and bound search tree
4.2.2 Basic Pruning
We start with two basic pruning strategies, k-core (KC) based pruning and spatial query-
point coverage (SQPC) based pruning, based on the degree constraint in a k-core and the
spatial query point coverage constraint, respectively.
KC based pruning
By the definition of k-core, we know that the minimum degree of each vertex in a k-core
should be no less than k. Therefore, in the branch and bound search, if the minimum
degree constraint cannot be satisfied after adding any new users from SU to SI, the search
process should backtrack to the previous state of SI (that is, the parent node of the node
representing SI in the search tree). We give the critical condition under which the current
SI may form a valid group below.
Theorem 4.2. Let umin ∈ SI be the user with the minimum number of neighbors in SI. If SI
is in any valid group with size s, then |NBSI(umin)|+ s− |SI| ≥ k where |NBSI(umin)|
is the number of umin’s neighbors in SI.
Proof. Since we can add only s − |SI| users from SU to SI, the degree of umin in any
valid group with size s is at most |NBSI(umin)| + s − |SI| (when all users in SU are
neighbors of umin). By Definition 4.1, to form a valid group, the degree of umin in the
group should be ≥ k. This establishes the theorem.
Theorem 4.2 implies that if the current SI cannot satisfy this condition, the entire
subtree rooted at the node representing SI can be skipped.
49
Example 4.2. Consider the LBSN in Figure 4.1. Let SI={u2, u4}, SU={u1, u3}, s = 3
and k = 2. Since u2 has the minimum number of neighbors in SI and |NBSI(u2)| = 0,
we can verify that the condition in Theorem 4.2 does not hold, and therefore we can stop
searching the users in SU.
SQPC based pruning
Any valid user group should cover all query points P . If SU cannot fully cover the rest
query points in P − PSI , where PSI is the set of points covered by SI, adding any user
from SU to SI cannot form a valid group. In this case, the search process can safely prune
the subtree rooted at SI without missing the optimal solution.
In some cases, even though SU can cover all query points in P −PSI , the users of SU
are still not a member of any valid group. Theorem 4.3 is given to capture such cases.
Theorem 4.3. Let umax ∈ SU be the user whose region covers the most query points in
P − PSI . To form a valid group with size s, SU should satisfy:
|P − PSI |s− |SI|
≤ |Pumax| (4.2.1)
where |Pumax| is the number of query points covered by umax.
Proof. |P − PSI | is the number of query points not covered by SI , and s − |SI| is the
number of users to be added from SU to SI . On average, each user to be added should
cover at least |P−PSI |s−|SI |
query points. Therefore, the number of points in P − PSI covered
by umax must be greater than or equal to the average.
Intuitively, given the number of query points, the size of the user group and SI , Equa-
tion 4.2.1 gives the lower bound of the maximum number of query points in P −PSI that
a user in SU should cover.
Example 4.3. Consider the LBSN in Figure 4.1. Suppose SI={u2}, SU={u1,u3,u4,u5,u6},
k = 2 and s = 3. We can compute |P−PSI |s−|SI |
= |{p1,p3,p4}|3−1 =3
2and |Pumax|=1. Since Equa-
tion 4.2.1 is not true, there is no need to search users in SU .
50
4.2.3 Diameter Based Pruning
In this section, we propose pruning techniques based on the concept of social diameter.
We first give the definition of the diameter of a group Cks with size s.
Definition 4.3. (Diameter) The diameter of a user group Cks in an LBSN G is defined as
the longest shortest path length between any two users in G[Cks ], denoted by DIA(Ck
s ).
Let DIAub(Cks ) denote the upper bound of DIA(Ck
s ). It is easy to derive that DIAub(Cks ) =
s − k. However, this bound is too loose when s is big. In this chapter, we make use the
more strict bound proposed in [55].
Theorem 4.4. For a user group Cks ,
DIAub(Cks ) =
1 if s = k + 1
2 if k + 1 < s < 2k + 2
3[ sk+1
] + r(s, k)− 3 if s ≥ 2k + 2
(4.2.2)
where r(s, k) =
0 if mod(s, k + 1) = 0
1 if mod(s, k + 1) = 1
2 if mod(s, k + 1) = 2
This diameter upper bound of Cks introduces a way to measure whether two users can
co-exist in Cks . Next we present two pruning techniques called social shortest path (SHP)
based pruning and spatial-social shortest path (SOSP) based pruning.
SHP based pruning
The SHP based pruning is inspired by the observation that, if the shortest path length be-
tween two users exceeds DIAub(Cks ), they cannot appear simultaneously in Ck
s . It follows
that a user v ∈ SU can be added into SI only when the shortest path length between v
and u ∈ SI satisfies the condition presented in Theorem 4.5.
Theorem 4.5. Let v be the user to be added into SI from SU , and Dist(SI, v) be the
maximum shortest path length between the users in SI and v. v can be added into SI
51
only if the following inequation is satisfied:
Dist(SI, v) ≤ DIAub(Cks ) (4.2.3)
Proof. By Definition 4.3, the shortest path length between any two users in a valid group
Cks should be ≤ DIA(Ck
s ). If v can be added into SI , then Dist(SI, v) ≤ DIA(Cks ) must
be true. Therefore, Dist(SI, v) ≤ DIAub(Cks ) is also true.
Example 4.4. Consider the LBSN in Figure 4.1. Suppose k=1 and s=3. Let SI =
{u4} and SU = {u5, u6}. By Theorem 4.5, we can compute, for any valid group Cks ,
DIAub(Cks ) = 2. The shortest path lengths between u4 and u5, u4 and u6, are 3 and 4,
respectively. Thus, no user in SU can be added into SI to form a valid group.
SOSP based pruning
SHP based pruning can quickly verify whether there exists a user in SU to form a valid
group with the current SI . To further reduce the search space, we present SOSP based
pruning, which considers not only the shortest path length between two users but also the
users’ covered query points.
Intuitively, for any valid user group Cks , if a user u and all other users in the circle
centered at u with diameter DIA(Cks ) cannot fully cover all query points P , u cannot be
a member of Cks . This implies that, in this case, for the given specific values of k and
s, u could be removed from the search space without missing the optimal solution. This
provides extra pruning capabilities on top of SHP based pruning. Let NBpu be the user
that has the minimum shortest path length to a user u in an LBSN and that covers a query
point p ∈ P . We formally capture this intuition in Theorem 4.6.
Theorem 4.6. Let Dist(u,NBpu) be the shortest path length between u and NBp
u. For
any valid group Cks , if Dist(u,NBp
u) > DIAub(Cks ) for some p ∈ P and p 6∈ u.R, then u
cannot be a user of Cks .
Proof. Suppose v is a user ofCks and there exists a query point p ∈ P satisfyingDist(v,NBp
v) >
DIAub(Cks ). Since Ck
s is a valid group, by Theorem 4.5 we have Dist(v, u) ≤ DIA(Cks )
52
u1 u4 u2 u3 u5 u6
p1
p2
p4
p3
1
2
4
6
Figure 4.3: Sorted user list ListP
where u ∈ Cks . Since p ∈ P must be covered by Ck
s , we have NBpv ∈ Ck
s and therefore
Dist(v,NBpv) ≤ DIA(Ck
s ), leading to a contradiction. This completes the proof.
Below we provide an example to illustrate how SOSP based pruning works.
Example 4.5. Consider constructing a valid groupCks with k = 1 and s = 3 for the LBSN
in Figure 4.1. The user set of this LBSN is {u1, u2, u3, u4, u5, u6}. From Theorem 4.5, we
get DIAub(Cks ) = 2. Since NBp4
u5and NBp4
u6are both u4, we have Dist(u5, u4) = 3 and
Dist(u6, u4) = 4. From Theorem 4.6, we learn that both u5 and u6 should be removed
from the search space of finding Cks . Now we get a smaller search space {u1, u2, u3, u4}.
For diameter based pruning techniques, we need to compute the shortest path length
between any pair of users. However, it is impossible to calculate the length on the fly,
because it will substantially increase the total running time of our algorithm. A possible
method is to pre-compute all the lengths offline and then index them for online query
processing. However, this approach needs O(n2) storage, where n is the number of users
in the LBSN. It is not feasible to store such big indexes when n is large. In this chapter,
we adapt the tree-structured index constructed based on the concept of vertex cover (V C-
index) [14], which can efficiently process distance queries between users with a small
storage cost. We also employ the caching technique to accelerate querying the shortest
path length of users. Given two users, we first retrieve the shortest path length between
them in the cache. If the length is cached, we read it directly. Otherwise, the length is
calculated from the V C-index. For the strategy of replacing the cache, we adopt the least
recently used (LRU) method.
53
4.2.4 Access Order Based Pruning
In the section, we propose the last pruning strategy based on the observation that more
search space can be pruned if the users are accessed in a certain order in the branch and
bound search. Given a set of users V and a set of query points P , we place the users in V
into several sets V P ={Vp1 , Vp2 , · · · , Vp|P |}, where Vpi is the set of users whose associated
region covers the point pi ∈ P . Note that a user u may belong to multiple Vpi , because
u’s associated region may cover one or more query points. We first sort V P in increasing
order of Vpi’s size, and then sequentially access Vpi and push all users in Vpi into a user
list ListP . If a user has been pushed into ListP , he/she can be skipped in later operations.
Thereafter, the search process adds users from SU to SI according to the their indexes in
ListP . We give an example of constructing ListP .
Example 4.6. Consider the users and query points in Figure 4.1(b). We can place the
users into four sets, Vp1 = {u1}, Vp2 = {u2, u3}, Vp3 = {u3, u5, u6}, Vp4 = {u4}. To
construct the sorted user list ListP , we first add u1 (in Vp1) to ListP , then u4, u2, u3 in
order. After that, since u3 has been added when processing Vp2 , he/she will be skipped
when accessing Vp3 . Finally, u5 and u6 are added. The constructed ListP is given in
Figure 4.3 (ignore the arrows for the moment).
Next we discuss how to make use of ListP to gain additional pruning capability. For
a query point p, we define its access index in Listp as follows.
Definition 4.4. (Access index) The access index of a query point p ∈ P in a sorted user
list ListP , denoted by A(p), is the index of the last user whose associated region covers
p.
The access indexes are illustrated in Figure 4.3. Suppose p is the query point in P
that has not been covered by SI . If the smallest index of users in SU is greater than
A(p), the search process should backtrack to the parent node of SI . For a GSKCG query,
we maintain the access index for each query point. Note that, for a GSKCG query, the
access indexes and Listp just need to be calculated once and do not need to be updated.
Therefore, they can be constructed efficiently.
54
R5 R6
R1 R2 R3 R4
u1 u2 u3 u4 u5 u6 u7 u8
Root
CBRR5CBRR6
CBRR1CBRR4
CBRu1
CBRu2CBRu7
CBRu8
Figure 4.4: A sample SaR-tree
Example 4.7. Continue with Example 4.6. We have A(p1) = 1, A(p2) = 4, A(p3) = 6
andA(p4) = 2. Suppose SI = {u1}, SU = {u2, u3, u5, u6}, and P −PSI = {p2, p3, p4}.
According to the access order in Listp, u2 should be the first to be added to SI . We
compare u2’s index in ListP , 3, with A(p2), A(p3) and A(p4). Since A(p4) = 2 < 3, no
user in SU can be added to SI . The search process backtracks to SI’s previous state.
4.3 Hybrid Indexing
In this section, we design a novel index structure, the Enhanced Social-aware R-tree (SaR-
tree), to further accelerate query processing.
4.3.1 SaR-tree
The SaR-tree structure [74] is a variant of R-tree that indexes both spatial locations and
social relations. Figure 4.4 illustrates a simple SaR-tree. Different from a classical R-
tree, each entry of an SaR-tree contains two major pieces of information: a set of core
bounding rectangles (CBRs) (see Definition 4.5) that encodes the social information and
a minimum bounding rectangle (MBR) that encodes the spatial information as in an R-
tree. Intuitively, a CBR bounds the users by the social constraint while an MBR bounds
the users by the spatial constraint, and therefore an SaR-tree gains the ability of both
55
u1
u2u3
u4
u5
u6
u7
u8
u9
r1
r2
r3
Figure 4.5: Example of CBRs in an SaR-tree
social-based and spatial-based pruning for GSKCG query processing.
Definition 4.5. (Core bounding rectangle) Consider a user u ∈ G. Given a minimum
degree constraint k, the core bounding rectangle CBRu,k is a rectangle that contains
u and inside which any user group with u (excluding the users on the bounding edges)
cannot be a k-core.
Note that, for given u and k, CBRu,k may not be unique. We illustrate the idea of
CBR in the following example.
Example 4.8. Consider the LBSN in Figure 4.5. Given k = 2, the rectangle r1 is a
CBRu2,2 because any user group inside r1 that contains u2 is not a 2-core. Similarly, r3
is another CBRu2,2 for u2. In contrast, r2 is not a CBRu2,2 because {u1, u2, u5} in r2
form a 2-core.
In addition to CBRs and an MBR, each entry in an SaR-tree also contains a core
number. A user u’s core number is the maximum k for which u belongs to a k-core,
denoted by cn(u). The core number of an entry e is defined as the maximum of the core
numbers of the users covered by e, denoted by cn(e).
4.3.2 Enhanced SaR-tree
Unfortunately, the SaR-tree structure proposed in [74] cannot support GSKCG queries.
The main reason is that the method of computing CBRs in [74] assumes that each user is
associated with a spatial point, whereas in our problem each user has an associated region.
56
u1
u2
u3
u4
u5
u
u7
u6
(a) Social networks
u4
u1
u2u3
u
u5
u6
u7
(b) Associated regions
Figure 4.6: A sample LBSN for constructing CBR
This fact significantly complicates the problem and demands a new method to construct
CBRs.
We propose a novel index structure, known as the enhanced SaR-tree, to address this
problem. To construct an Enhanced SaR-tree over an LBSN, we first construct a standard
R-tree rtree and then compute the CBR for each entry in rtree. To compute the CBR of
an entry, we should know how to build a user’s CBR. The general idea of constructing a
user’s CBR includes two steps. First, as the users’ associated regions may intersect with
each other, we calculate the user’s internal CBR (see Definition 4.6). Second, given the
user’s internal CBR, we expand it to obtain the corresponding external CBR (see Defini-
tion 4.7), from which the user’s CBR will be selected. We give the formal definitions of
these two types of CBRs below. For ease of exposition, we denote “a k-core containing a
user u” by “k(u)”.
Definition 4.6. (Internal CBR) Given a k value, a user u’s internal CBR iCBRu,k is a
rectangle that is inside u.R and that does not contain a k(u).
Example 4.9. Consider the LBSN in Figure 4.6. Figure 4.7 shows some iCBRu,2 of user
u, marked by the shaded areas. Figure 4.7(a–b) and Figure 4.7(c–e) show iCBRu,2 of
user u in x-direction and y-direction, respectively. In Figure 4.7(a), the shaded area is an
iCBRu,2 of user u because: 1) it is inside u.R, and; 2) the users in this iCBRu,2 (i.e.,
u1, u2, and u) cannot form a 2-core containing u.
57
u4
u1
u2 u3
u
u5
l1 l2
u4
u1
u2 u3
u
u5
l1 l2
(a) (b)
u4
u1
u2 u3
u
u5
l2
l1
(e)
u4
u1
u2 u3
u
u5
l1
l2
u4
u1
u2 u3
u
u5
l2
l1
(c) (d)
Figure 4.7: Constructing user u’s internal CBRs
Definition 4.7. (External CBR) Given a user u’s internal CBR iCBRu,k, the correspond-
ing external CBR eCBRu,k is defined as a rectangle that: 1) contains this iCBRu,k, and;
2) is inside the MBR of u’s parent in rtree, and 3) does not contain a k(u).
Example 4.10. Continue with Example 4.9. Given a user u’s iCBRu,2 in Figure 4.7(a),
Figure 4.8 shows the corresponding eCBRu,2. The outermost rectangle marks the MBR
of u’s parent in the enhanced SaR-tree.
u4
u1
u2 u3
u
u5
u6
u7
Figure 4.8: Constructing a user u’s external CBRs
Algorithm 8 describes how to construct a CBR of a user u. We first use an R-tree to
find the users whose associated regions overlap with u.R, and add them into a user set
H (Line 1, Algorithm 8). We then construct a set of iCBRu,k of u from two directions
58
Algorithm 8 GetUserCBR (User u, Integer k, LBSN G, Enhanced SaR-tree rtree)1: H ← The users whose familiar regions overlap with u.R;2: X ← Left and right edges of the familiar region of each user in H;3: Y ← Top and bottom edges of the familiar region of each user in H;4: Sort the elements in X and Y in ascending order;5: iCBR(X)← GetInternalCBRs(u, k, X , H , G);6: iCBR(Y )← GetInternalCBRs(u, k, Y , H , G);7: iCBRs← iCBR(X) ∪ iCBR(Y );8: eCBRs← GetExternalCBRs(u, k, iCBRs, G, rtree);9: Return the element of eCBRs with the maximum area;
Algorithm 9 GetInternalCBRs (User u, Integer k, Line set X , User set H , LBSN G)1: LB ← A line on the left edge of u.R;2: `1 ← LB;3: `2 ← LB;4: iCBRs← ∅;5: while `1 and `2 do not exceed the right edge of u.R do6: OperateBoth(`1, `2, X , G);7: OperateL2(`2, X , G);8: iCBRs.add(Λ[`1, `2, u.R]);9: OperateL1(`1, X , G);
10: Return iCBRs;
(i.e., x-direction and y-direction). We put the left and right (or bottom and top) edges
of the familiar regions of the users in H into the line set X (or Y ), respectively, and
sort the lines in X (or Y ) in ascending order in order to facilitate the construction of
internal CBRs (Line 4, Algorithm 8). Then we use the GetInternalCBRs function to
generate the internal CBRs on both directions. Based on these internal CBRs, we invoke
the GetExternalCBRs function to calculate the corresponding external CBRs. Finally,
the external CBR with the maximum area is returned as u’s CBR. Next we elaborate
GetInternalCBRs and GetExternalCBRs.
The GetInternalCBRs function. The general idea of constructing iCBRu,k of a user u is
to alternately slide two vertical (or horizontal) lines on u.R in x-direction (or y-direction).
In the end, the area inside the intersection of these two lines and u.R will be u’s iCBRu,k.
Since the construction of iCBRu,k in y-direction is similar to that in x-direction, we only
discuss the case for x-direction.
Given a user u, a value of k, a sorted line setX and a user setH , we primarily perform
three kinds of operations on lines `1 and `2 (i.e., move `1 and `2 simultaneously, move `2
59
alone, and move `1 alone) to obtain iCBRu,k of u in x-direction. Initially, we place both
`1 and `2 on the left edge of u.R (Lines 1–3, Algorithm 9), and then move `1 and `2
rightward using one of the following operations:
1. OperateBoth: When `1 and `2 overlap with each other, we move them rightward
to the next line in X (but not exceeding the right edge of u.R) such that the users in
H whose familiar regions are touched by `1 and `2, denoted by H(`1), do not form
a k(u).
2. OperateL2: We move `2 rightward to the next line in X (not exceeding the right
edge of u.R) such that the users in the rectangle bounded by `1, `2 and u.R, denoted
by Λ[`1, `2, u.R], form a k(u).2 Now, Λ[`1, `2, u.R] is an internal CBR of u.
3. OperateL1: We move `1 rightward to the next line in X (not exceeding the right
edge of u.R) such that the users in Λ[`1, `2, u.R] do not form a k(u).3 Note that `1
is always on the left hand side of `2.
We alternate these three types of operations until both `1 and `2 stop at the right edge of
u.R. Finally, GetInternalCBRs returns all internal CBRs of u.
Example 4.11. Consider the LBSN in Figure 4.6. We illustrate how to compute iCBRu,2
of user u in x-direction in Figure 4.7. Initially, lines `1 and `2 are placed on the left edge
of u.R. Since at this time H(`1) = {u1, u2, u} and these users do not form a k(u) with
k = 2, there is no need to move `1 and `2. We then move `2 rightward to the left edge of
u4, and now the users in the rectangle Λ[`1, `2, u.R], {u1, u2, u4, u}, form a k(u). So the
current Λ[`1, `2, u.R] is an internal CBR of u. Next, we move `1 rightward until the right
edge of u1 because now the users in Λ[`1, `2, u.R], {u2, u4, u}, do not contain a k(u). We
continue this process until both `1 and `2 reach the right edge of u.R
The GetExternalCBRs function. Given the set of iCBRu,k returned by Algorithm 9,
Algorithm 10 is designed for constructing user u’s eCBRu,k. The basic idea is to expand
each iCBRu,k in iCBRs to obtain the corresponding eCBRu,k. We expand each edge
2Once `2 touches the left edge of a user’s familiar region, this user is in the rectangle.3Once `1 touches the right edge of a user’s familiar region, this user is not in the rectangle any more.
60
Algorithm 10 GetExternalCBRs (User u, Integer k, Internal CBRs iCBRs, LBSN G,Enhanced SaR-tree rtree)
1: eCBRs← ∅;2: for each internal CBR iCBRu,k in iCBRs do3: eCBRk,u ← Expand each edge of iCBRu,k until a k(u) appears or the edge reaches the
MBR boundary of u’s parent in rtree;4: eCBRs.add(eCBRk,u);
5: Return eCBRs;
of iCBRu,k outward until the users within iCBRu,k form a k(u). Recall that, by Defini-
tion 4.7, u’s eCBRu,k is inside the MBR of u’s parent in the enhanced SaR-tree rtree.
So we should stop expanding an edge once it reaches the boundary of the MBR.
Example 4.12. Continue with Example 4.11. Given iCBRu,2 of user u shown in Fig-
ure 4.7(a), we show an example of constructing the corresponding eCBRu,2 in Figure 4.8.
Assume that the outermost rectangle is the MBR of u’s parent. We sequentially move each
of the four edges of iCBRu,2 outward until getting the shadow area.
Finally, we discuss how to compute the CBR of each entry in the enhanced SaR-tree
by a bottom-up approach. A leaf entry’s CBR is the CBR of the user it represents. For
an internal entry e, let its child entries be e1, e2, · · · , em. Given the minimum degree
constraint k, e’s CBR CBRe can be computed by recursively applying the following
function on its child entries’ CBRs CBRei:
CBRei+11 ,k =
CBRei1,k,
if MBRei+1∩ CBRei1,k
= ∅
CBRei1,k∩ CBRei+1,k
otherwise
(4.3.4)
where CBReij ,kdenotes the CBR constructed from CBRej ,k, CBRej+1,k, · · · , CBRei,k.
Therefore, CBRe,k = CBRem1 ,k. It is easy to verify that, by this construction, any user
group within CBRe,k cannot be a k-core, giving extra pruning capabilities.
In practice, k usually does not have to be a large value. Setting k to 1, 2, or 3 normally
suffices for all ordinary requirements of social constraint. Thus, when k is small, we can
build indexes for each of the possible k values with reasonable space and time. Without
61
loss of generality, we also discuss the case when k is large. Here we can select a set of k
values to build the indexes by considering the property of k-core, that is, k-core⊆(k-1)-
core. With this property, we can only build indexes for k = 20, 21, · · · , 2blogcn(e)2 c where
cn(e) is the maximum core number of its child entries rooted at entry e. Given a GSKCG
query Q = (k′, P ), we can make use of the CBRs of k = 2blog2k′c(k is left close to k′) for
GSKCG query processing. This method may incur false positive users (the users whose
core number is less than k′), which may in turn enlarge the searching space and increase
the computation cost in the later steps. However, it does not compromise the correctness
of the query results and normally can be done with reasonable space cost of the indexes
and efficiency of query processing.
We also give a space complexity analysis for our proposed Enhanced SaR-tree. For an
entry e/user u, we only store the CBRs of e or u for the core number 20, 21, · · · , 2blogcn(e)2 c.
Let M denote the maximum core number of the users in G, s be the fanout of our index,
and n be the number of users in an LBSN G. The upper bound of the total number of
CBRs (denoted by Ncbr) in an Enhanced SaR-tree can be computed:
Ncbr ≤ 2n(blogM2 c+1)
s+∑
u∈V (blogcn(u)2 c)
≤ n(2(blogM2 c+1)
s+ 1 + blog2
∑u∈V cn(u)
nc)
(4.3.5)
Since M and∑
u∈V cn(u)
nis usually small in a social network, thus the space cost of CBRs
is comparable to that of G. For the datasets used in our experiments, the maximum M
and∑
u∈V cn(u)
nof both datasets are 52 and 3.8, respectively. We set the fanout s of our
indexes to 100. Hence, we can calculate the maximum number of CBRs of our Enhanced
SaR-tree on both datasets, which is around 2.12n.
4.3.3 GSKCG Query Processing
In this section, we present our integrated algorithm SaRBasedKCGFinder. Generally, the
algorithm consists of two steps: 1) filter impossible users based on the enhanced SaR-tree;
2) feed the remaining users to KCGFinder. We give the details of SaRBasedKCGFinder
in Algorithm 11.
62
Algorithm 11 SaRBasedKCGFinder(Query points P , Integer k, Enhanced SaR-Treertree, LBSN G)
1: MBR(P )← The minimum rectangle containing all points in P ;2: Initialize H with the root of rtree;3: while H has non-leaf entries do4: e← The first non-leaf entry in H;5: for each child entry e′ of e do6: if MBR(P ) ∩ e′.MBR 6= ∅ and cn(e′) ≥ k and MBR(P ) 6⊂ CBRe′,k then7: H.push(e′);8: VH ←The set of users represented by the entries in H;9: Return KCGFinder(P , k, G[VH ]);
We first calculate the minimum rectangle containing all query points P (i.e., the cov-
erage of P ), denoted by MBR(P ). We iteratively prune impossible users in the LBSN
G by traversing the enhanced SaR-tree rtree. Note that, for the same LBSN G, rtree
just needs to be constructed once and thereafter can be used for all GSKCG queries. At
each entry e of rtree, we compare MBP (P ) with e’s MBR and CBR and check the core
number of e in order to prune out the users that cannot appear in the final result (Line 6,
Algorithm 11). Finally, we feed the subgraph of G that contains the users represented by
the entries in H to KCGFinder and return its output.
It is easy to extend our algorithm to support the case where each user has multiple
associated regions. For each associated region of a user u, we index u and this associated
region one time in the Enhanced SaR-tree. Thus, the number of associated regions of u
corresponds the times of u being indexed. In spatial filtering, if a user u appears more
than one time, we simply combine them together. The following branch and bound search
process remains the same as the case where each user has exactly one associated region.
4.4 Performance Evaluation
In this section, we experimentally study the performance of three algorithms. The first one
is the basic KCGFinder (referred to as Baseline) presented in Section 4.2.1. The second
one is KCGFinder coupled with the set of pruning techniques (Advanced). The third one
is SaRBasedKCGFinder (SaRBased).
63
Table 4.2: Dataset propertiesBrightkite Gowalla
Total # of users 58,228 196,591Total # of friend relations 214,018 950,327Medium region area (km2) 12.24 10.23Diameter (longest shortest path) 16 14Total # of check-ins 4,491,143 6,442,890Maximum # of cores 52 51
4.4.1 Datasets and Queries
We evaluate the proposed algorithms on two datasets collected from Brightkite and Gowal-
la,4 two real-world LBSNs. The properties of the two datasets are summarized in Ta-
ble 4.2. Since these websites do not directly provide users’ regions, we use a density-
based clustering method to form their regions from check-in locations. A user may have
several clustered associated regions. By default, we choose the region with the most
check-ins as a user’s associated region. The medium areas of the regions are 12.24 km2
on Brightkite and 10.23 km2 on Gowalla, respectively. More specifically, the cumulative
distribution of the region sizes on Brightkite is: 21.5% of regions ≤ 0.1 km2; 33.2%, ≤1
km2; 45.6%, ≤10 km2; 79.4%, ≤100 km2, and so on. For Gowalla, the distribution is:
19.1% of regions ≤ 0.1 km2; 28.2%, ≤1 km2; 50.7%, ≤10 km2; 80.4%, ≤100 km2, and
so on. In our experiments, we also show the results of the case where users have mul-
tiple associated regions. The testing GSKCG queries are randomly generated on these
two real-life datasets. The query set on Brightkite includes 100 queries, while the query
set on Gowalla includes 200 queries. Each query contains several query points that are
randomly selected from all users’ associated regions.
4.4.2 Setup
All the algorithms are implemented in Java programming language. The models for the
CPU and RAM are Intel Xeon X5650 Processor 2.67G Hz and 8GB DDR3 memory,
respectively. The number of query points and the value of k in a query both vary from 1
to 5. Unless explicitly specified, the default value of k and the default number of query
points in a query are both set to 3. The fanout of the Enhanced SaR-tree is 100. The
4Publicly available at: http://snap.stanford.edu/.
64
10
100
1000
10000
100000
1 2 3 4 5
Ru
nn
ing
Tim
e (
ms)
Varying # of k (Brightkite)
BaselineAdvancedSaRBased
100
1000
10000
100000
1e+006
1 2 3 4 5
Ru
nn
ing
Tim
e (
ms)
Varying # of k (Gowalla)
BaselineAdvancedSaRBased
Figure 4.9: Running time vs. k value
storage cost and building time of the Enhanced SaR-tree on Brightkite and Gowalla are
(96.3 MB, 10.23mins) and (275.6 MB, 29.03mins), respectively.
4.4.3 Experimental Results
We evaluate the performance of these three algorithms under different parameter settings.
As in many other performance evaluation schemes for query processing [41, 67, 68], we
report the overall performance in terms of the average query running time.
Effect of the value of k. In the first set of experiments, we evaluate the performance
of the algorithms under different k values. From Figure 4.9, we can observe that both
Advanced and SaRBased perform substantially better than Baseline. Note that the y-axis
is in log-scale. With the increase of k, SaRBased exhibits increasingly better performance
than Baseline and Advanced. This is because, the larger k is, the larger CBRs are, and
therefore the query points are more likely to be covered by larger CBRs, leading more
tree branches to be pruned. When k = 1, the difference of query time among these three
algorithms is small. The reason is that k = 1 indicates a loose social constraint and small
CBRs, resulting in weak pruning capabilities.
Effect of the number of query points. In Figure 4.10, we examine the query perfor-
mance by varying the number of query points. In general, the query time increases when
the number of query points increases because more query points require more candidate
users to be added into the search space. Compared to the other two algorithms, SaRBased
is relatively less sensitive to the increase of the number of query points. Even when the
number is 5, the performance of SaRBased is still reasonably good.
Effect of the coverage of query points. In this set of experiments, we evaluate the
65
10
100
1000
10000
100000
1 2 3 4 5R
un
nin
g T
ime
(m
s)
Varying # of Query Points (Brightkite)
BaselineAdvancedSaRBased
100
1000
10000
100000
1e+006
1e+007
1 2 3 4 5
Ru
nn
ing
Tim
e (
ms)
Varying # of Query Points (Gowalla)
BaselineAdvancedSaRBased
Figure 4.10: Running time vs. number of query points
0
100
200
300
400
0.1km2
1km2
10km2
100km2
Ru
nn
ing
Tim
e (
ms)
Varying Query Points Coverage (Brightkite)
BaselineAdvancedSaRBased
100
1000
10000
100000
0.1km2
1km2
10km2
100km2
Ru
nn
ing
Tim
e (
ms)
Varying Query Points Coverage (Gowalla)
BaselineAdvancedSaRBased
Figure 4.11: Running time vs. query point coverage
performance by varying the coverage of query points. From Figure 4.11, we find that
SaRBased still performs best under different coverages. When the coverage is small,
the query points are more likely to fully fall into some CBRs of the enhanced SaR-tree,
leading to a smaller search space. This explains why the running time is shorter when the
coverage is smaller.
Effect of multiple associated regions. As a proof-of-concept, in Figure 4.12, we present
the performance of the algorithms when each user has on average 3 associated regions.
Although it takes longer to process the queries, the general trends of Figure 4.12 are
similar to those of a single associated region presented in Figures 4.9 and 4.10. When a
user is associated with more associated regions, the running time increases because the
number of users covering the query points increases, which implies a larger search space.
Due to space limitations, we only show the performance on Brightkite. We observe similar
results on Gowalla.
Pruning capabilities of different strategies. In Figure 4.13, we show the pruning capa-
bilities of different pruning strategies, where BP, DBP, AO and SAR stand for basic prun-
ing, diameter based pruning, access order based pruning and enhanced SaR-tree based
pruning, respectively. We can see that all strategies can help reduce the running time.
66
100
1000
10000
100000
1 2 3 4 5
Runnin
g T
ime (
ms)
Varying # of k (Brightkite)
BaselineAdvancedSaRBased
100
1000
10000
100000
1 2 3 4 5
Ru
nn
ing
Tim
e (
ms)
Varying # of Query Points (Brightkite)
BaselineAdvancedSaRBased
Figure 4.12: Running time under multiple familiar regions
10
100
1000
10000
100000
1 2 3 4 5
runnin
g tim
e (
ms)
Varying # of k (Brightkite)
BPBP+DBPBP+DBP+AOBP+DBP+AO+SAR
10
100
1000
10000
100000
1 2 3 4 5
Runnin
g T
ime (
ms)
Varying # of Query Points (Brightkite)
BPBP+DBPBP+DBP+AOBP+DBP+AO+SAR
Figure 4.13: Pruning capabilities of different schemes
In particular, AO is most effective when the value of k or the number of query points is
relatively large.
Sizes of returned user groups. Figure 4.14 shows the average group size of the query
results. We can see that when k or the number of query points is increasing, the group
size is also increasing. The reason is that when k is increasing, the lower bound of the
group size to form a k-core group is increasing. Meanwhile, if the number of the query
points is increasing, the group may need more users to cover the query points.
Effect of the size of LSBNs. Next, we show the scalability of the algorithms under
various network sizes in Figure 4.15. We randomly extract several subsets of users with
increasing sizes to test the algorithms’ scalability. As expected, the result demonstrates
that SaRbased achieves the best efficiency in all cases. Even over relatively large networks
(e.g., more than 180k users), it still responds quickly, demonstrating the potential for
practical use.
Quality of query results. Finally, we compare the quality of results returned by GSKCG
and the existing spatial task outsourcing in terms of the average group size and the average
social cohesiveness (e.g., the average number of the familiar persons of each member in a
group) of the returned group. Specifically, we consider the typical spatial task outsourcing
67
2
4
6
8
10
1 2 3 4 5# o
f gro
up s
ize
Varying # of k
BrightkiteGowalla
2
4
6
8
10
1 2 3 4 5
# o
f gro
up s
ize
Varying # of query points
BrightkiteGowalla
Figure 4.14: Size of query results
0
100
200
300
400
500
600
10k 20k 30k 40k 50k
Ru
nn
ing
Tim
e (
ms)
Varying Network Size (Brightkite)
BaselineAdvancedSaRBased
10
100
1000
10000
100000
30k 60k 90k 120k 150k 180k
Ru
nn
ing
Tim
e (
ms)
Varying Network Size (Gowalla)
BaselineAdvancedSaRBased
Figure 4.15: Running time vs. network size
(STO) problem that finds a minimum group to collaboratively cover a given number of
spatial point related task. We set the parameters of the GSKCG query Q = (k, P ) to be
k = 2 and |P | = 5 and the parameters of STO query Q = (P ) (i.e., not considering the
social constraint) to be |P | = 5. From the experimental results shown in Figure 4.16,
we have the following interesting observation: the average size of the group returned
by GSKCG queries is close to the size of the group returned by STO queries, whereas
the average social cohesiveness of the group returned by GSKCG queries is much larger
than that of the group returned by STO queries. Thus, we conclude that, compared with
STO queries, the GSKCG query can find groups with much better social cohesiveness at
the cost of a small increase in the group size. This is very meaningful for the real-life
applications of collaborative spatial computing.
68
0
2
4
6
GSKCG STO
Qua
lity
of r
esul
t
Query types (Brightkite)
group sizesocial cohesiveness
0
2
4
6
GSKCG STO
Qua
lity
of r
esul
t
Query types (Gowalla)
group sizesocial cohesiveness
Figure 4.16: Quality comparison of the returned groups
4.5 Summary
In this chapter, we have introduced a new practical type of GSKCG queries that considers
both users’ associated spatial regions and their social acquaintance levels. A GSKCG
query aims to find a minimum user group that covers all query points and that is a k-
core. We have proposed an efficient algorithm SaRBasedKCGFinder to find the optimal
solution, whose success lies in a set of effective pruning strategies and a novel index
structure. Extensive experiments on two real-life datasets demonstrate the efficiency and
effectiveness of our solution.
69
Chapter 5
Towards Social-aware Ridesharing
Group Query Services
As described in Chapter 1, ridesharing is a promising approach for saving energy con-
sumption and assuaging traffic congestion while satisfying people’s needs in commute.
However, the main problem in the current ridesharing systems is the trust issue which
makes the acceptance level of ridesharing low. To tackle this problem, in this chapter, we
propose a new kind of ridesharing queries, namely social-aware ridesharing group (SaRG)
queries, which is based on trip matching and social acquaintance. The rest of this chap-
ter is organized as follows. The SaRG query problem is formulated in Section 5.1. We
propose the baseline algorithm and a set of pruning strategies for SaRG query process-
ing in Section 5.2. In Section 5.3, we present several incremental approaches to reduce
a large number of repeated computations. The SIR-tree index structure is presented in
Section 5.4. Experimental results are reported in Section 5.5, followed by the summary
of this chapter in Section 5.6.
5.1 Problem Formulation
In this section, we present some preliminaries and provide the problem statement, fol-
lowed by an example to illustrate the problem defined. Table 5.1 summarizes the notations
used throughout this chapter.
70
Table 5.1: Summary of notationsNotation DefinitionG = (V,E) a social network.D the rider space in which each rider has a ride request.G[V ′] the subgraph of G containing only V ′.u, ui, v, vi a user of G, u or ui represents a driver, v or vi represents a rider.tpv = (o, d) tpv indicates the rider v’s ridesharing trip where o and d represent the origin
and destination of v, respectively.tpu tpu is a driver u’s ridesharing trip.qu an SaRG query of the driver u.D(tpv , tpu) the travel cost of a rider vGk
s (u) a ridesharing group containing a driver u and a set of s riders.D(tpu, Gk
s (u)) the travel cost of a ridesharing group Gks (u).
SI an intermediate solution set where |SI| ≤ s.SU the set of remaining riders.Dlb(tpu, SI) the travel cost lower bound of any valid ridesharing group derived from SI .Lmv a size-m rider list in which the seen riders are sorted in ascending order by
their travel costs.CS the sorted list of seen riders.Dlb(v,CS) the lower bound of the travel cost on the unseen rider v in D − CS.Dlb(Gk
s (u), Lmv ) the lower bound of the travel cost on the unseen ridesharing groups Gk
s (u) inLmv .
NBSI(v) the set of v’s neighbors in SI .Dia(Gk
s (u)) the diameter of Gks (u).
Diaub(Gks (u)) the upper bound of Dia(Gk
s (u)).A(v) the access index of the last user of user v’s neighbor in Lm
v .kSI(v) the core number of user v in the subgraph G[SI].cmax(e) the maximum core number of the users rooted at the entry e of the SIR-tree.nb[e|x] the user set containing the users whose social distances to all users rooted at e
are ≤ x.
As motivated by the social-aware ridesharing framework in Chapter 1, we define an
SaRG query over a set of riders D and a social network G=(V,E). Each rider v∈D
has a ridesharing trip request denoted by tpv=(o, d) where o and d represent the origin
and destination of v’s trip, respectively. For the social network G, each vertex v∈V
is a user (either a driver or a rider) and each edge e∈E denotes an acquainted relation
between two users it connects. Each driver u’s ride offer forms an SaRG query qu that
will be introduced later. Once the RSP receives an SaRG query qu from a driver u, it will
return u with the most suitable riders from D by considering trip matching and social
acquaintance. Before giving the formal definition of an SaRG query, we explain how to
measure trip matching between the riders and driver, and social acquaintance among the
members in a ridesharing group.
An SaRG query aims to find a ridesharing group with a desired level of social acquain-
tance. To model such social acquaintance, we assume the existence of a social network
graph in which users are connected if they have acquaintance relationships (e.g., friends
or colleagues). Such a network might be derived from call graphs based on telephone call
71
Table 5.2: Survey results (216 participants)Social Model Acceptable for RidesharingStar(friend) 95.43%Star(friend of friend) 71.23%1-core 92.24%
detail records (CDRs) or online social networks such as Facebook and Twitter [16]. There
are a number of social models that can be employed to measure the social cohesiveness
of a ridesharing group, such as star (friend) (one central user has direct connections to all
other users), star (friend of friend) (one central user has direct or through-a-friend connec-
tions to all other users), and k-core (see Definition 5.1, each user has direct connections
with at least k other users).
Definition 5.1. (k-core) Given a graph G=(V,E), a k-core is a connected subgraph
SG=(SV, SE) (SV⊆V , SE⊆E) in which each vertex v∈SV has at least degree k.
To compare these social models, we have conducted an online survey with 216 vol-
unteers to evaluate their acceptance levels for ridesharing (see Table 5.2 for the survey
result).1 In addition to users’ acceptance level, the feasibility of forming ridesharing
groups in real-life applications (i.e., whether the service provider can find ridesharing
groups for drivers which satisfy the social model being used) is an equally important fac-
tor in selecting an appropriate social model. For this reason, we have also examined the
potential groups under different social models for the users of New York City in two real
datasets (Brightlike and Gollawa [17]). Figure 5.1 gives the numbers of potential size-5
ridesharing groups. It can be observed that while the star (friend) model achieves a good
acceptance level, it is too demanding to form a good number of social groups. Combining
these two aspects, in this chapter we take k-core as the primary social model to address
the social-aware ridesharing problem.
We next explain how to measure trip matching of a ridesharing group. The primary
cost of a rider in Slugging is the travel cost between the rider’s origin, destination and
the driver’s origin, destination. Therefore, we define the travel costs of a rider and a
ridesharing group as follows.
1http://www.sojump.com/.
72
0 20000 40000 60000 80000
Gollawa
Brightlike
Star (Friend)
Star (Friend of Friend)
1-Core
Figure 5.1: Numbers of potential social groups of size 5
Definition 5.2. (Travel cost of a rider) Given the trip tpu of a driver u’s ride offer, the
travel cost of a rider v is defined as:
D(tpv, tpu) = ||tpv.o, tpu.o||+ ||tpv.d, tpu.d||, (5.1.1)
where ||·, ·|| denotes the Euclidean distance between two spatial points.
A ridesharing group consists of a driver u and a size-s set of riders, denoted byGks(u),
where s is the number of available seats. Note that the size of a ridesharing group is s+1.
Definition 5.3. (Travel cost of a ridesharing group) Given a driver u’s trip tpu, the
travel cost of a ridesharing group Gks(u) is
D(tpu, Gks(u)) =
∑v∈Gk
s (u)
D(tpv, tpu). (5.1.2)
We call a ridesharing group Gks(u) a k-core group if the subgraph G[Gk
s(u)] of the
underlying social network G is a k-core. Now we are ready to define an SaRG query.
Definition 5.4. (SaRG query) Given a set of riders D and a social network G=(V ,E),
an SaRG query is defined as a quadruple qu=(u,k,s,tpu), where qu.u is the driver (query
issuer), qu.k and qu.s are positive integers indicating the social acquaintance constraint
in terms of k-core and the number of available seats for ridesharing respectively, and
qu.tpu is the driver u’s trip, which returns the ridesharing group Gks(u) with the minimum
travel cost among all k-core ridesharing groups with size s+1 in G. A ridesharing group
is valid with respect to an SaRG query qu if it is a k-core group and its size is s+1.
The returned ridesharing group should have the smallest travel cost because naturally
73
only riders whose origin and destination are close to those of the driver are willing to join
the driver’s ridesharing. Below we give an example to illustrate an answer to an SaRG
query.
o1d1
o2
d2
o3
d3
o
d
v2
v1v3
u
1 1
2 1.5
2.5 3
Social Level
Spatial Level
Figure 5.2: An example of an SaRG query
Example 5.1. Consider a social network G=(V ,E), a set of riders D={v1,v2,v3}, a
driver u, as shown in Figure 5.2. The travel cost of a rider vi (1≤i≤3) is listed in the
right table of Figure 5.2. The SaRG query qu=(u,k,s,tpu) with k=2, s=2, and tpu=(o,d)
returns Gks(u)={u,v1,v3} because {u,v1,v3} is the group with the minimum travel cost
D(tpu, {u, v1, v3})=(1+1)+(2.5+3)=7.5 among all size-3 2-core ridesharing groups (the
other group is {u,v2,v3}).
We establish the hardness of the SaRG query problem in the theorem below.
Theorem 5.1. The SaRG query problem is NP-hard.
Proof. We prove the hardness by a reduction from a classical NP-Complete problem,
namely p-clique problem [33]. An instance of the p-clique problem consists of a graph
G′=(V ′,E ′) where V ′ and E ′ are the vertex set and edge set of G′, respectively. The
decision problem is to find whether there exists a clique (i.e., a complete subgraph) of
size-p in G′.
Given an instance of p-clique, we construct an instance of SaRG query qu=(u,k,s,tpu)
on a set of users with G=G′, s=p-1, k=s, and make the travel cost between any two users
in G be 1. If G′ contains a p-clique, there must exist a group of size p in G such that each
member in this group has social connections with the other s members of this group, and
the group has a minimum travel cost of s. We thus prove the necessary condition. On the
74
other hand, if G of the SaRG problem contains a group of size p and k=s, G′ in the p-
clique problem must contain a clique with size p, too. This gives the sufficient condition.
Hence, the theorem is proved.
In this chapter, we tackle the problem of efficiently processing SaRG queries in prac-
tical settings.
5.2 Algorithm Design
In this section, we present an algorithm named RSGExplorer and a set of pruning strate-
gies to obtain the optimal answer to an SaRG query.
5.2.1 RSGExplorer Algorithm
The general idea of RSGExplorer is that, given an SaRG query qu=(u,k,s,tpu), we first
retrieve the top-m (m≥s) riders in D with the minimum travel cost, and then invoke the
branch and bound search to find the current optimal answer Gks(u) in these top-m riders.
If the travel cost ofGks(u) is less than the travel cost lower bound of the unseen ridesharing
groups, Gks(u) is returned as the final optimal answer. Otherwise, we continue to retrieve
the top-(m+1) rider and reinvoke the branch and bound search to find the next optimal
answer. The above process repeats until the final optimal answer is identified.
To retrieve the top-m riders with the minimum travel cost, we build two spatial RTree
indexes [28], rtreeo and rtreed, to index the origins and destinations of the riders in D,
respectively. By adopting typical kNN search in spatial databases [52] over the RTree in-
dex, we can easily visit the riders in increasing order of the distance between their origins
(destinations) and the driver’s origin (destination). Algorithm 12 shows the pseudo code
of RSGExplorer. In the beginning, we initialize a set of variables (Lines 1–7). The prior-
ity queues Qo and Qd are initialized as (rtreeo.root,0) and (rtreed.root,0), respectively.
The elements of Qo (Qd) are sorted in increasing order by the shortest distances between
their corresponding RTree entries and the driver’s origin (destination). m and Dlb(v,CS)
are respectively initialized to s and∞, which are used for finding the top-m riders with
75
Algorithm 12 RSGExplorer(Driver u, Integer k, Integer s, Trip tpu, SocialNetwork G,RTree rtreeo, RTree rtreed)
1: Initialize priority queues Qo and Qd with entries (rtreeo.root, 0) and (rtreed.root, 0), re-spectively;
2: Integer m← s;3: Initialize the sorted rider lists CS and Lm
v as ∅;4: Ridesharing group Gk
s(u)← φ;5: Double cost←∞6: Double Dlb(v, L
mv )←∞;
7: Initialize the rider sets SI and SU as ∅;8: while |CS| ≤ |D| do9: v′ ← GetNextRider(Qo,rtreeo); // tpv′ .o is closest to tpu.o
10: v′′ ← GetNextRider(Qd,rtreed); // tpv′′ .d is closest to tpu.d11: Insert rider v′ (v′′) into CS unless v′ ∈CS (v′′ ∈CS);12: if |CS| ≥ m then13: Dlb(v,CS)← ||tpv′ .o, tpu.o||+ ||tpv′′ .d, tpu.d||;14: Rider vm ← the m-th rider in the rider list CS;15: if D(tpvm , tpu) ≤ Dlb(v,CS) then16: Insert the first m riders of CS into Lm
v ;17: SI ← vm;18: SU ← Lm
v − vm;19: Gk
s(u)′ ← GetOptimalGroup(u,k,s,tpu,SI ,SU ,Gks(u),G);
20: if D(tpu, Gks(u)′) ≤ cost then
21: Gks(u)← Gk
s(u)′;22: cost← D(tpu, G
ks(u))
23: if cost ≤ Dlb(Gks(u), Lm
v ) then24: Return Gk
s(u);25: m=m+1;26: Return ∅;
the minimum travel cost. The optimal ridesharing group to return, Gks(u), is initialized to
∅. Note that, the initial travel cost of an empty Gks(u), cost, is set to∞. Two sorted rider
lists CS and Lmv , in which the riders are sorted in ascending order by their travel costs, are
both initialized as ∅. In addition, two rider sets SI and SU are also set to ∅ for the branch
and bound search over Lmv in a later stage.
After the initialization stage, we use the function GetNextRider(·) (the typical kNN
search mentioned above) to find the next rider v′ (v′′) whose origin (destination) is closest
to the driver’s origin (destination). We compute the travel cost of v′ (v′′) and insert v′ (v′′)
into CS if v′ 6∈CS (v′′ 6∈CS) (Lines 9–11). Once the size of CS becomes≥ m, we calculate
the travel cost lower bound Dlb(v, Lmv ) of the unseen riders according to Theorem 5.2.
Theorem 5.2. Let v′ and v′′ be the riders newly found by GetNextRider(·) in rtreeo and
76
rtreed respectively, and CS be the seen sorted rider list. The travel cost lower bound of
the unseen rider v 6∈ CS is Dlb(v,CS)=||tpv′ .o, tpu.o||+||tpv′′ .d, tpu.d||.
Proof. Since v′ and v′′ are the riders newly found by GetNextRider(·) in rtreeo and
rtreed respectively, for any unseen rider v, we have ||tpv.o, tpu.o|| ≥ ||tpv′ .o, tpu.o|| and
||tpv.d, tpu.d|| ≥ ||tpv′ .d, tpu.d||. Thus, the travel cost of v
D(tpv, tpu) = ||tpv.o, tpu.o||+ ||tpv.d, tpu.d||
≥ ||tpv′ .o, tpu.o||+ ||tpv′′ .d, tpu.d||
= Dlb(v,CS)
Thus, this theorem is proved.
If the travel cost of the m-th rider vm in CS is ≤ Dlb(v,CS), the top-m riders in D
with the minimum travel cost are found (Lines 13–15). Afterwards, we insert the first m
riders in CS into Lmv (Line 16). We add vm into SI and Lm
v -vm into SU in order to make
sure that vm is in the newly found group, which can avoid duplicately enumerating the
ridesharing groups that appear in the previous iterations (Lines 17–18). The intuition is
that, all the ridesharing groups, which consist of s+1 users from Lmv -vm (i.e., Lm−1
v ), have
been checked in previous branch and bound search over the search space Lm−1v . Thus we
only need to check the remaining ridesharing group consisting of vm and the other s users
from Lmv -vm.
We then invoke Algorithm 13 to find the current optimal answer Gks(u)′ in Lm
v (Line
19). If the travel cost of Gks(u)′ is less than or equal to the travel cost lower bound of the
unseen ridesharing groups, Gks(u)′ is returned as the final optimal answer (Lines 20–24).
The correctness is guaranteed by Theorem 5.3.
Theorem 5.3. Let vi be the i-th rider in the current sorted rider list Lmv . A lower bound
of the travel cost of unseen ridesharing group Gks(u) in Lm
v is
Dlb(Gks(u), Lm
v ) = D(tpvm , tpu) +s−1∑j=1
D(tpvj , tpu). (5.2.3)
77
Proof. For any unseen ridesharing group Gks(u), there must exist a rider vt ∈ Gk
s(u) such
thatD(tpvt , tpu)>D(tpvm , tpu). As v1,v2,. . . ,vs−1 are the top-(s-1) riders in Lmv , we have∑
v∈Gks (u)−vt
D(tpv, tpu) >∑s−1
j=1D(tpvj , tpu). Therefore, we have
D(tpu, Gks(u)) =
∑v∈Gk
s (u)
D(tpv, tpu)
= D(tpvt , tpu) +∑
v∈Gks (u)−vt
D(tpv, tpu)
> D(tpvm , tpu) +s−1∑j=1
D(tpvj , tpu)
= Dlb(Gks(u), Lm
v )
This proves the theorem.
Algorithm 13 shows the pseudo code of GetOptimalGroup which attempts to find the
most suitable ridesharing group from Lmv . For a systematic enumeration of all candidate
ridesharing groups, we employ the branch and bound search algorithm. In the branch
and bound search process, we keep track of two rider sets SI and SU, which represent the
intermediate solution set and the set of remaining riders, respectively. This process can
be organized into a tree structure, as illustrated in Figure 5.3, in which an internal node
represents an SI and a leaf node represents a size-s rider set. Given an internal node SI, we
can derive a lower travel cost bound Dlb(tpu, SI) of any valid ridesharing group derived
from SI and SU, which is guaranteed by Theorem 5.4.
Theorem 5.4. Let v′ be the rider with the maximum travel cost in SI . The travel cost
lower bound of any valid ridesharing group derived from SI and SU is
Dlb(tpu, SI) = (s− |SI|) ∗D(tpv′ , tpu)
+∑v∈SI
D(tpv, tpu). (5.2.4)
Proof. Suppose Gks(u) is an intermediate answer derived from SI and SU . For any rider
v ∈ Gks(u)− SI , we have D(tpv, tpu) > D(tpv′ , tpu). Thus,
∑v∈Gk
s (u)−SID(tpv, tpu) >
78
v1
v1v2
v1v2v3
NULL
v1v3
v1v2v4
Expanding
Backtracking
v1v3v4
Figure 5.3: Branch and bound search tree
∑v∈Gk
s (u)−SID(tpv′ , tpu) = (s− |SI|) ∗D(tpv′ , tpu). We have:
D(tpu, Gks(u)) =
∑v∈Gk
s (u)
D(tpv, tpu)
=∑
v∈Gks (u)−SI
D(tpv, tpu) +∑v∈SI
D(tpv, tpu)
≥(s− |SI|) ∗D(tpv′ , tpu) +∑v∈SI
D(tpv, tpu)
=Dlb(tpu, SI)
Thus, this theorem is proved.
Based on this travel cost lower bound, any ridesharing group derived from SI and
SU with travel cost ≥ Dlb(tpu, SI) can be pruned from the search space (Line 2). We
iteratively add riders with the minimum travel cost from SU to SI and check whether
the resultant group is valid (Lines 4–8). A property of k-core is that if a vertex is not
in the maximum k-core of G, it cannot be in any k-core subgroup of G. Thus, if the
maximum k-core computed from G[u ∪ SI ′ ∪ SU ] cannot cover all riders in u ∪ SI ′,
no valid ridesharing group can be derived from SI and SU (Lines 10–12). Otherwise, we
recursively call GetOptimalGroup with the current values of the input arguments to find
the optimal answer Gks(u)′. If D(tpu, G
ks(u)′) < D(tpu, G
ks(u)), we update the current
optimal answer Gks(u) with Gk
s(u)′ (Lines 14–17).
We establish the correctness of RSGExplorer below.
Theorem 5.5. RSGExplorer finds the correct answer to an SaRG query.
Proof. We prove it by contradiction. Assume that RSGExplorer returnsGks(u) as the final
79
Algorithm 13 GetOptimalGroup(Driver u, Integer k, Integer s, Trip tpu, RiderSet SI ,RiderSet SU , SaRG Gk
s(u), SocialNetwork G)1: while |SI|+|SU |≥s do2: if D(tpu, G
ks(u)) ≤ Dlb(tpu, SI) then
3: Break;4: Select the rider v with the minimum travel cost from SU ;5: SI ′←SI∪{v}, SU←SU -{v};6: if |u∪SI ′|=s+1 then7: if G[u∪SI ′] is a k-core then8: Return u∪SI ′;9: else
10: Compute the maximum k-core group S from G[u ∪ SI ′ ∪ SU ];11: if u ∪ SI ′ 6⊆ S then12: Break;13: else14: SU ← S-SI ′-u;15: Gk
s(u)′ ← GetOptimalGroup(u,k,s,tpu,SI ′,SU ,Gks(u),G);
16: if D(tpu, Gks(u)′) < D(tpu, G
ks(u)) then
17: Gks(u)← Gk
s(u)′;18: Return Gk
s(u);
optimal answer to qu = (u, k, s, tpu). Now suppose there exists Gks(u)′ with the minimum
travel cost such that D(tpu, Gks(u)′)<D(tpu, G
ks(u)), where Gk
s(u) and Gks(u)′ are found
fromLmv andLm′
v respectively. There are three possible cases: (1) Ifm<m′, which means
thatLmv ⊂Lm′
v , RSGExplorer first findsGks(u), and thenGk
s(u)′. By Theorem 5.3, we have
D(tpu, Gks(u)) ≤ Dlb(Gk
s(u), Lmv ) ≤ D(tpu, G
ks(u)′), which contradicts the assumption
thatD(tpu, Gks(u)′)≤D(tpu, G
ks(u)). (2) Ifm =m′, which means thatLm
v =Lm′v , we have
D(tpu, Gks(u)′) =D(tpu, G
ks(u)). Thus, RSGExplorer must returnGk
s(u) orGks(u)′ as the
final answer. (3) If m>m′, which means that Lm′v ⊂ Lm
v , RSGExplorer first finds Gks(u)′,
and then Gks(u). Again by Theorem 5.3, we have D(tpu, G
ks(u)′) ≤ Dlb(Gk
s(u), Lmv′) ≤
D(tpu, Gks(u)). Thus, RSGExplorer must return Gk
s(u)′ in Lm′v as the final answer, not
Gks(u) in Lm
v . Hence, the correctness of RSGExplorer is proved.
As proved in Theorem 5.5, RSGExplorer correctly finds the optimal answer. However,
the enumerating process is time consuming. Thus, we develop several pruning strategies
to prune the search space in order to accelerate the search speed.
80
v3 v4
uv7v5
v6v2
v1
(a) A social network
Rider
2
3.5
4
4
4.5
4.5
7
(b) The travel cost of rider vi
Figure 5.4: An example of SaRG query
5.2.2 Quota Available Strategy
By definition of k-core, we know that the degree of any vertex in the subgraph G[Gks(u)]
should be at least k. For the rider sets SI and SU, if the minimum vertex degree ofG[u∪SI]
is ≤ s − |SI|, the quota left in any valid ridesharing group containing u ∪ SI, adding any
rider in SU into SI cannot form a valid ridesharing group. This intuition is formalized in
Theorem 5.6.
Theorem 5.6. Let NBu∪SI(v) be the set of neighbors of v in u∪SI . Ifmin{|NBu∪SI(v)||v ∈
u ∪ SI} + s − |SI| < k, no valid ridesharing group can be derived from the current SI
and SU.
Proof. Let v′ be the user with the minimum number of neighbors in u∪SI . Since we can
add only s− |SI| users from SU to SI , the degree of v′ in any valid group with size s+1
is at most |NBSI(v′)| + (s + 1) − |u ∪ SI| = |NBSI(v
′)| + s − |SI| (when all users in
SU are neighbors of v′). By Definition 5.1, to form a valid group, the degree of v′ in the
group should be ≥ k. This establishes the theorem.
Example 5.2. Consider an example of an SaRG query qu=(u,k,s,tpu) with k=3, s=3,
in Figure 5.4. The social relations of the users are shown in Figure 5.4(a). Suppose
the current SI = {v1, v2} and SU = {v3, v4, v5, v6, v7}, we can calculate that u has the
minimum number of neighbors among u ∪ SI = {u, v1, v2} leading to |NBu∪SI(u)| = 0.
Since |NBu∪SI(u)| + s − |SI| = 0 + 3 − 2 = 1 < k, according to Theorem 5.6, we
conclude that no valid ridesharing group can be derived from the current SI and SU.
81
Table 5.3: Access indexes of users in Figure 5.4A(u) A(v1) A(v2) A(v3)
6 3 1 5A(v4) A(v5) A(v6) A(v7)
7 6 5 4
In Theorem 5.6, we only consider the quota constraint between the group size s+1
and the social constraint k. In some cases, even if SI satisfies Theorem 5.6, riders from
SU still cannot be added into SI to form a valid group, for example, when the riders in SI
do not have neighbors in SU. Hence, we design an access index (see Definition 5.5) for
efficiently detecting such cases (see Theorem 5.7).
Definition 5.5. (Access index) Let idx(v) be the index of user v ∈ Lmv . The index of u is
set to idx(u) = 0. The access index of a user v ∈ u ∪ Lmv is A(v) = max{idx(v′)|v′ ∈
NBu∪Lmv
(v)}.
Continue with the example of SaRG query in Figure 5.4. Lmv = {v1, v2, v3, v4, v5, v6, v7}
is the sorted user list in which the users are sorted by their travel costs in ascending order.
The access indexes of the users in u ∪ Lmv are given in Table 5.3.
Based on the definition of access index, we have the following theorem.
Theorem 5.7. Let k(u ∪ SI) be the maximum core number of the subgraph G[u ∪ SI]. If
max{idx(v)|v ∈ SI} ≥ min{A(v)|v ∈ u ∪ SI} and k(u ∪ SI) < k, no valid ridesharing
group can be derived from the current SI and SU .
Proof. The condition k(u∪SI) < k means that a rider v′ ∈ SU should be added into SI
to increase the core number of SI ∪ u to form a k-core. Since max{idx(v)|v ∈ SI} ≥
min{A(v)|v ∈ u ∪ SI}, it means that a user in u ∪ SI does not have neighbor in SU ,
which results in no valid ridesharing group (i.e., k-core) formed from the current SI and
SU . This theorem is proved.
Theorem 5.7 tells that if u∪ SI is not a k-core group and there exists a user v ∈ u∪ SI
that has no neighbor in SU, no valid ridesharing group can be derived from current SI and
SU. Example 5.3 illustrates a case of Theorem 5.7.
82
Example 5.3. Consider a social network and a sorted rider listLmv ={v1,v2,v3,v4,v5,v6,v7}
with the access indexes shown in Table 5.3. Consider an SaRG query qu={u,k,s,tpu} with
k=3 and s=3, the current SI={v1,v2,v3} and SU={v4,v5,v6,v7}. We havemax{idx(v)|v ∈
{v1, v2, v3}}= idx(v3) = 3 and k(u, v1, v2, v3) = 0. From Table 5.3, we knowmin{A(v)|v ∈
{u, v1, v2, v3}} = A(v2) = 1. Based on Theorem 5.7, because idx(v3) ≥ A(v2) and
k(u, v1, v2, v3) < k, thus we cannot derive a valid ridesharing group from the current SI
and SU .
5.2.3 Group Diameter Strategy
In this section, we propose another pruning technology based on the concept of k-core
group diameter. For a given size-(s+1) k-core group, we first present the definition of the
group diameter, and then derive the diameter upper bound of a size-(s+1) k-core group.
Based on the diameter upper bound, we introduce our diameter based pruning method.
Definition 5.6. (Diameter) The diameter of a ridesharing group Gks(u) in G is defined as
the longest social distance (i.e., the longest shortest path length) between any two users
in G[Gks(u)], denoted by Dia(Gk
s(u)).
Let Diaub(Gks(u)) denote the upper bound of Dia(Gk
s(u)). In this chapter, we make
use of the k-core group diameter upper bound proposed in [55].
Theorem 5.8. For a valid ridesharing group Gks(u),
Diaub(Gks(u)) =
1 if s = k
2 if k < s < 2k + 1
3[ s+1k+1
] + r(s+ 1, k)− 3 if s ≥ 2k + 1
(5.2.5)
where r(s+ 1, k) =
0 if mod(s+ 1, k + 1) = 0
1 if mod(s+ 1, k + 1) = 1
2 if mod(s+ 1, k + 1) = 2
This diameter upper bound of Gks(u) indicates a way to measure whether two users
83
can co-exist in Gks(u). Next we present Lemma 5.1 to prune the search space by using the
diameter upper bound of Gks(u).
Lemma 5.1. Consider an SaRG query qu = (u, k, s, tpu). For each v ∈ SU , if the social
distance between v and any user v′ ∈ u ∪ SI is larger than Diaub(Gks(u)), v cannot be
added into SI to form a valid ridesharing group.
Example 5.4. Consider an SaRG query qu=(u,k,s,tpu) with k=2 and s=2, the users’
social relations, and the travel costs shown in Figure 5.4. According to Theorem 5.8,
we can get the diameter upper bound of the query qu is 1. Assume that the current SI
and SU are ∅ and {v1,v2,v3,v4,v5,v6,v7} respectively. As the driver u must be a member
of valid ridesharing group, thus the riders in SI whose social distances to u are larger
than 1 should be filtered out from SU . We then get SU={v4,v6}. Obviously, the diameter
based pruning method can substantially shrink the search space. We can quickly derive
the optimal ridesharing group Gks(u)={u,v4,v6}.
5.2.4 k-plex Based Strategy
In this section, we present a novel pruning technique based on the concept of k-plex [6].
The advantage of k-plex lies in its property that, if G′ is a k-plex, any subgraph of G′
is also a k-plex. In contrast, k-core does not share such a property. Fortunately, we can
easily transfer a k-plex to a k-core in order to enjoy this property. To find an SaRGGks(u),
we convert it to finding a k-plex of size s+1 where k=(s+1-k). Thus, if we can identify
a maximum k-plex in G′ whose size is ≥ s + 1, there must exist a k-core with size s+1.
Otherwise, G′ does not contain a k-core with size s+1 (i.e., no valid ridesharing group
can be found from G′).
Definition 5.7. (k-plex) Given a graphG=(V,E), a k-plex is a subgraph SG = (SV, SE)
(SV⊆V , SE⊆E) in which each vertex v∈SV has at least degree |SG|-k.
To estimate the maximum size of a k-plex, we adopt the approach presented in [47]
and calculate the size upper bound Bp(G) of a maximum k-plex in a graph G as follows
Bp(G) = mini=1,··· ,p{1
iB(Ci
1, · · · , Cim)}, (5.2.6)
84
and
B(Ci1, · · · , Ci
m) =
mi∑j=1
min{2k − 2 + k mod 2, k + ai,j,
∆(G[Cij]) + k, |Ci
j|} (5.2.7)
where k=s+1-k, Ci1, · · · , Ci
m are co-k-plexes [18] in which each vertex of V appears
exactly i times, ai,j = max{m : |{v ∈ V ∧degG(v) ≥ m}| ≥ k+m} for each Cij , p is an
integer parameter to limit the iterations of computing, and degG(v) represents the vertex
v’s degree in G.
Lemma 5.2. Given an SaRG query qu = (u, k, s, tpu), if the size upper bound of k-plex
Bp(u∪SI∪SU) (k=(s+1)-k) is less than s+1, no valid ridesharing group can be derived
from the current SI and SU.
The size upper bound of the maximum k-plex is effective in pruning the search space.
Example 5.5 illustrates a case of Lemma 5.2.
Example 5.5. Consider the users shown in Figure 5.4(a). Assume that the current SI={v1,
v2, v3} and SU={v4,v5,v6,v7}. Given an SaRG query qu={u,k,s,tpu} with k=4 and s=4,
we can calculate the size upper bound of a ((4+1)-4)-plex, which is 4. Since the requested
group size is 4+1=5 > 4, no valid ridesharing group can be derived from the current SI
and SU.
5.2.5 Query Processing
In Algorithm 14, we integrate the three aforementioned types of pruning strategies into
GetOptimalGroup. We call this integrated algorithm GetOptimalGroupStar. The differ-
ences between GetOptimalGroupStar and GetOptimalGroup are in Lines 4–14 of Algo-
rithm 14. If there is not enough quota in SI to form a k-core, the search process on
the current SI and SU will stop and backtrack to the last stage of SI (Lines 4–6). Oth-
erwise, we will continue to verify if max{idx(v)|v ∈ SI} ≥ min{A(v)|v ∈ u ∪ SI}
and k(u ∪ SI) < k. If yes, the search process on the current SI and SU will be stopped
85
Algorithm 14 GetOptimalGroupStar(Driver u, Integer k, Integer s, Trip tpu, RiderSetSI , RiderSet SU , SaRG Gk
s(u), SocialNetwork G)1: while |SI|+ |SU | ≥ s do2: if D(tpu, G
ks(u)) ≥ Dlb(tpu, SI) then
3: Break;4: Select the user v′ with minimum |NBSI∪u(v′)| from SI ∪ u;5: if min{|NBu∪SI(v)||v ∈ u ∪ SI}+ s− |SI| < k then6: Break;7: if max{idx(v)|v ∈ SI} ≥ min{A(v)|v ∈ u ∪ SI} and k(u ∪ SI) < k then8: Break;9: for each rider v ∈ SU do
10: if all social distances from v to SI ≤ Diaub(Gks(u)) then
11: Add v into SU ′;12: Compute the group size upper bound Bp(G[u∪SI ∪SU)] of (s+1-k)-plex in G[u∪SI ∪
SU ];13: if Bp(G[u ∪ SI ∪ SU ]) < s+ 1 then14: Break;15: Select the rider v with minimum travel cost from SU ′;16: SI ′ ← SI ∪ {v}, SU ′ ← SU ′ − {v};17: if |u ∪ SI ′| = s+ 1 then18: if G[u ∪ SI ′] is a k-core then19: Return u ∪ SI ′;20: else21: Compute the maximum k-core S of the subgraph G[u ∪ SI ′ ∪ SU ];22: if u ∪ SI ′ 6⊆ S then23: Break;24: else25: SU ← S − SI ′ − u26: Gk
s(u)′ ← GetOptimalGroupStar(u,k,s,tpu,SI ′,SU ′,Gks(u),G);
27: if D(tpu, Gks(u)) > D(tpu, G
ks(u)′) then
28: Gks(u)← Gk
s(u)′;29: Return Gk
s(u);
(Lines 7–8). The correctness is guaranteed by Theorem 5.6 and Theorem 5.7. Given the
current SI ∪ u, the users in SU, whose social distances to all the users in SI are less than
the group diameter upper bound Dub(Gks(u)) justified will be filtered out from SU. This
pruning strategy (justified by Lemma 5.1) can substantially shrink the search space and
reduce the time cost (Lines 9–11). Afterwards, according to Lemma 5.2, we calculate
the group size upper bound Bp(u∪SI∪SU ) of an ((s+1)-k)-plex in G[u∪SI∪SU ]. If
Bp(u∪SI∪SU )<s+1, no valid Gks(u) is found from the current SI and SU (Lines 12–14).
86
5.3 Incremental Strategies
For the GetOptimalGroupStar algorithm, there is still room to further improve its perfor-
mance. First, in each iteration, when a rider is added into or removed from SI and SU, the
core decomposition algorithm is invoked to recompute ku∪SI′∪SU(v) where v∈u∪SI ′∪SU
(Line 21, Algorithm 14), which is a time-consuming operation. In fact, as illustrated lat-
er there is no need to recompute all the ku∪SI′∪SU(v). Second, the conditions that some
riders cannot co-exist in a valid ridesharing group, which were checked in the previous
iterations, may still hold in the subsequent iterations. By properly reusing the previous
useful information, it is possible to avoid many repeated computations. Thus, in this sec-
tion, we design several incremental strategies (i.e., incremental computation of core num-
bers, social diameter-based bounding and neighbor-based bounding) to further reduce the
running time.
5.3.1 Incremental Computation of Core Numbers
In GetOptimalGroupStar, each time a rider is added into or removed from SI and SU,
the core decomposition algorithm is invoked over the current SI and SU. It means that we
need to recompute the core numbers of all riders in the current SI and SU. Such operations
are conducted frequently during the search process, which increases the running time.
Example 5.6 illustrates such a case.
Example 5.6. Consider the users in Figure 5.4(a). Let G[V ] be a subgraph of G, which
has been decomposed in the previous iteration, where V ={u,v1,v2,v3,v4,v5,v6}. The core
numbers of these vertices are: kV (u)=2, kV (v1)=1, kV (v2)=1, kV (v3)=1, kV (v4)=2,
kV (v5)=2, kV (v6)=2. Here, kV (v) denotes the core number of v in G[V ]. When user
v7 is added into V , the core numbers of u, v1,v2,v3,v4,v5,v6 do not change. Only the newly
added user v7’s core number needs to be computed, which is 1.
Here, we make use of Theorem 5.9 from [53] to shrink the vertex space by indicating
which vertices’ core numbers may change. We also adopt the Traversal Algorithm in [53]
for the incremental core decomposition when a user is added into or removed from the
87
current search space.
Theorem 5.9. Given a graph G=(V,E), if an edge (u, v) is inserted (removed) and
kV (u)≤kV (v), then only the vertex w∈V , which has kV (w)=kV (u) and is reachable from
u via a path consisting of the vertices with core number equal to kV (u), may have its core
number incremented (decremented).
Example 5.7. Continue with Example 5.6. Before the user v7 is added, the core num-
bers of the vertices in V ={u,v1,v2,v3,v4,v5,v6,v7} are: kV (u)=2, kV (v1)=1, kV (v2)=1,
kV (v3)=1, kV (v4)=2, kV (v5)=2, kV (v6)=2, kV (v7)=0. After v7 is added, since we have
an edge (v4, v7) inserted and kV (v7)=0 ≤ kV (v4), according to Theorem 5.9, only the
vertices whose core number is 0 and which are reachable from v7 via a path consisting
of vertices with core number 0 may have their core number changed. Since no user in V
satisfies this condition, only kV (v7) needs to increase by 1, leading to kV (v7)=1. The time
cost is reduced by avoiding recomputing kV (vi) (1≤i≤6).
5.3.2 Social Diameter-based Bounding
The diameter-based pruning technique presented in Section 5.2.3 indicates that some users
cannot co-exist in a ridesharing group due to the group diameter upper bound. Suppose
users v′ and v′′ cannot co-exist in any Gks(u) found from the current search space S. If
several users are added into S to form a new search space S ′, any user group from S ′
containing v′ and v′′ still cannot satisfy the group diameter upper bound. Therefore, if
such combinations of users calculated in the previous iterations can be cached, we can
prune out all the user groups containing such combinations directly. When the cache size
is fixed, the SI with a smaller size has a higher priority to be cached. This is because the
set of a smaller size usually appears in a higher level of the branch and bound search tree
(see Figure 5.3), allowing to prune more combinations which cannot co-exist in a valid
ridesharing group.
Example 5.8. Continue with the example in Figure 5.4(a). Given an SaRG query qu
= (u, k, s, tpu) with k=2 and s=2, we can calculate the diameter upper bound of a
88
valid ridesharing group Gks(u), 1, based on Lemma 5.1. Thus, any user set whose social
diameter is more than 1 cannot be the final result. Before adding v7 into the search space,
we have already known v2,v3 cannot co-exist in a valid group. Hence, we cache this
combination v2 and v3. In the next iteration, when v7 is added into the search space, any
user group containing v2,v3 can be directly removed from the solution space.
5.3.3 Neighbor-based Bounding
Consider an intermediate rider set SI and a remaining rider set SU for an SaRG query
qu=(u,k,s,tpu) with k=2 and s=3. Assume the current G[u∪SI] is a size-3 2-core. To
form a valid ridesharing group of size-(3+1), one more rider v∈SU should be selected
and added into SI. According to the previous strategy, v should be the rider in D with the
minimum travel cost. However, if v is not a neighbor of any member in u∪SI , adding v
into SI may not help form a valid 2-core ridesharing group. If the next added riders are
most similar to v, the time cost would increase.
To reduce such non-beneficial adding operations, one possible way is to quickly find
a travel cost lower bound of the optimal ridesharing group derived from current u ∪ SI
and its members’ social neighbors in SU to prune these non-beneficial operations. Here,
we design a greedy algorithm GreedyRSGSearch (Algorithm 15) to find such a travel cost
lower bound. The general idea is to greedily retrieve an ((s+1)-k)-plex containing u∪ SI
with size s+1. Then, the valid ridesharing group, which is composed of the users of the
found ((s+1)-k)-plex, is updated as the current optimal answer to prune the search space
in future iterations when its travel cost is the current lowest.
Algorithm 15 shows the pseudo code of GreedyRSGSearch. We first initialize rider
v as the rider in SU with the maximum travel cost, and the neighbor set NBs of users in
SI as ∅ (Lines 1–2). Thereafter, we continue to add the rider v′ into SI until the size of
u∪SI ≥ (s+1). To select the rider v′, we first add all the neighbors of the users in u∪SI
that belong to SU into NBs (Line 4), and then select the rider v′ ∈NBs with the minimum
travel cost that makesG[v′∪u∪SI] an ((s+1)-k)-plex (Lines 5–8). If such a rider v′ exists,
we replace v by v′ and add it into SI. We repeat the same process until the size of u∪ SI is
89
Algorithm 15 GreedyRSGSearch(Driver u, Integer k, Integer s, Trip qu, RiderSet SI ,RiderSet SU , SocialNetwork G)
1: Rider v ← the rider in SU with maximum travel cost ;2: RiderSet NBs← ∅;3: while |u ∪ SI| < s+ 1 do4: NBs← the neighbors of the users in u ∪ SI that belong to SU;5: for each user v′ ∈NBs do6: if G[v′∪u∪SI] is a ((s+1)-k)-plex then7: if D(tpv′ , tpu)≤D(tpv, tpu) then8: v←v′;9: if v is not the previous rider then
10: Add v into SI;11: else12: Return ∅13: Return SI;
≥ s+1. Otherwise, we return the empty set. If we find an ((s+1)-k)-plex with size-(s+1),
the current ridesharing group u∪SI is returned, and the travel cost of the group u∪SI
provides the travel cost lower bound to prune the search space in the subsequent itera-
tions. During the search process of GetOptimalGroupStar, GreedyRSGSearch is invoked
to find a valid ridesharing group with a tight travel cost bound. If the returned group is
empty, which means that no tight travel cost bound can be found, GetOptimalGroupStar
continues using the previously found travel cost lower bound to prune the search space.
Note that GreedyRSGSearch needs to be invoked only if the current G[u ∪ SI] is an
((s+1)-k)-plex. The reason is that u∪SI has more chance to form a valid ridesharing
group when it is a ((s+1)-k)-plex. Otherwise, the time cost would increase much.
Example 5.9. Consider the users’ social relations and their travel costs in Figure 5.4. As-
sume the current SI=∅ and SU= {v1,v2,v3,v4,v5,v6,v7}. Given an SaRG query qu=(u,k,s,tpu)
with k=2 and s=2, we call GreedyRSGSearch to search a travel cost lower bound from
u ∪ SI={u} and G in Figure 5.4. We can get an intermediate answer Gks(u)={u, v4, v6}
and calculate the group travel cost lower boundD(tpu,{u,v4,v6}) =D(tpv4 ,tpu) +D(tpv6 ,tpu)
= 4 + 4.5 = 8.5. Therefore, there is no need to attempt the groups whose travel cost ≥
8.5. For example, when u ∪ SI = {u, v3}, there is no need to move v5, v6, v7 from SU to
SI; when u ∪ SI = {u, v4}, there is no need to move v5, v6, v7 from SU to SI.
90
5.4 Hybrid Index
Index is a commonly used technique to optimize query performance. Recently, several
approaches have been developed for geo-social group queries by considering the users’
Euclidean distances and their social relations, e.g., SR-tree [68] and SaR-tree [39]. How-
ever, these indexes are not directly applicable to our problem. In this section, we first
propose an R-tree based index, namely Social-Info R-tree, which incorporates the social
information, and then integrate the proposed index into RSGExplorer.
5.4.1 SIR-tree
For the reason of efficiency, we propose a hybrid indexing structure, the Social-Info R-
tree (SIR-tree), to support the simultaneous computation of the spatial distance and the
social constraint. It is a tree-based structure which is able to prune the search space using
the maximum core bound and the ridesharing group diameter bound. Each internal tree
node e stores the following social information: (i) the maximum core number cmax(e) of
the child nodes rooted at this node; (ii) the set nb[e|x] containing the users whose social
distances to all users rooted at e are less than or equals to x. Since nb[e|x] =⋃
e′∈e nb[e′|x]
where e′ is the child node of e, the SIR-tree can be built in a bottom-up fashion. Figure 5.5
shows an example of SIR-tree. In RSGExplorer, an SIR-tree is used to find the next rider
with the minimum travel cost. Its advantages are at least two-fold:
• By cmax(e), it can prune the users who cannot appear in the final k-core result set
as early as possible;
• By nb[e|x], it can prune the users whose social distances to the query issuer are
larger than Diaub(Gks(u)) as early as possible.
Based on the SIR-tree proposed above, Theorem 5.10 is given below to assist in prun-
ing the search space during query processing.
Theorem 5.10. Consider an SaRG query qu=(u,s,k,tpu) and an internal node e in SIR-
tree. If cmax(e) < k or u 6∈ nb[e|Diaub(Gks(u))], then any user rooted at node e cannot
be a member of the final optimal answer.
91
R5 R6
R1 R2 R3 R4
v1 v2 v3 v5 v4 v6 v7 u
SocialInfo 4
SocialInfo 3
SocialInfo 5 SocialInfo 6
SocialInfo 7
R7
R5 R6
R1 R2 R3 R4
SocialInfo 1cmax(R1)=1
nb[R1|1]={v1,v2,v3}
SocialInfo 2cmax(R2)=2
nb[R2|1]={v1,v3,v4,v5,v6}
Figure 5.5: An example of SIR-tree
Proof. It can be easily derived from Definition 5.1 and Lemma 5.1. We omit it here
because of the space limitation.
Example 5.10. Consider an SaRG query qu=(u,k,s,tpu) with k=2 and s=2, the social
network shown in Figure 5.4(a), and the social information of the nodes R1 and R2 in
Figure 5.5. According to Theorem 5.8, we can calculate Diaub(Gks(u))=1. During the
search process, R1 can be pruned due to the fact cmax(R1) < k; R2 can be pruned
because u 6∈ nb[R2|Diaub(Gks(u))].
5.4.2 Query Processing
To process SaRG queries with an SIR-tree, we need to reconcile the method introduced
in Section 5.3 with a major modification of how to find the next rider with the lowest
travel cost. Since there is no social information recorded in the R-tree, the rider we get is
only spatially close to tpu. However, if a user’s core number is less than the query social
constraint k, or the social distance between a user and the query issuer is larger than the
valid group diameter upper bound, the user should be pruned from the search space as
early as possible. Otherwise, it will increase the computational cost in the subsequent
brand and bound search. With the help of the SIR-tree, we can prune the tree nodes in
which the users cannot appear in the final result in advance and shrink the search space
of the brand and bound search for the optimal answer. Thus, the SIR-tree based algorithm
achieves better query performance than the algorithms proposed in the previous sections.
92
Table 5.4: Dataset propertiesBrightkite(America) Brightkite(Europe) Gowalla(America) Gowalla(Europe)
Total # of users 12,363 4,385 18,983 26,912Total # of friend relations 115,506 12,271 115,506 157,006Total # of trips 12,363 4,385 18,983 26,912Diameter (social diamter) 10 11 13 14Maximum # of cores 34 25 36 39
5.5 Performance Evaluation
In this section, we experimentally evaluate the performance of three algorithms: The first
one is the basic RSGExplorer with three pruning strategies (referred to as Baseline) pre-
sented in Section 5.2; the second one is Baseline with the incremental methods (referred
to as Incremental) presented in Section 5.3; the last one is Incremental based on SIR-tree
(referred to as SIRBased) presented in Section 5.4.
5.5.1 Experimental Settings
We make use of four datasets extracted from Brightkite and Gowalla [17]: Brightkite
(America), Brightkite (Europe), Gowalla (America), and Gowalla (Europe). The proper-
ties of the four datasets are summarized in Table 5.4.
Each query set on these four datasets includes 100 queries. Each query contains a
query issuer u randomly generated from the corresponding user space, a group size s
varying from 4 to 7, a social constraints k from 1 to 4, and a query trip tpu randomly
selected from the users’ trips. Unless explicitly specified, the default values of k and s in
a query are 3 and 5, respectively.
All the algorithms are implemented in Java programming language. The models of
the CPU and RAM are Intel Xeon X5650 Processor 2.67G Hz and 8GB DDR3 memory,
respectively. The fanouts of R-tree and SIR-tree are 100.
5.5.2 Experimental Results
We evaluate the query processing performance of these three algorithms under differ-
ent parameter settings. Following many other query processing performance evaluation
methods, we report the overall query performance in terms of the average elapsed time.
93
10
100
1000
10000
100000
4 5 6 7R
unni
ng ti
me
(ms)
s (k=3)
BaselineIncrementalSIRBased
(a) Brightkite(America)
10
100
1000
10000
4 5 6 7
Run
ning
tim
e (m
s)
s (k=3)
BaselineIncrementalSIRBased
(b) Brightkite(Europe)
10
100
1000
10000
100000
1e+006
4 5 6 7
Run
ning
tim
e (m
s)
s (k=3)
BaselineIncrementalSIRBased
(c) Gowalla(America)
10
100
1000
10000
100000
4 5 6 7R
unni
ng ti
me
(ms)
s (k=3)
BaselineIncrementalSIRBased
(d) Gowalla(Europe)
Figure 5.6: Running time vs. group size
Effect of s. In the first set of experiments, we evaluate the query performance under
different s values. From Figure 5.6, we can observe that both Incremental and SIRBased
perform better than Baseline. Note that the y-axis is in log-scale. Under different values of
s, SIRBased achieves the best performance. This conforms to our theoretical analysis: the
SIR-tree structure can efficiently prune many irrelevant users who cannot satisfy either the
social diameter or core number constraints as early as possible, leading to a much smaller
search space. Even when s is small, SIRBased algorithm performs the best because a
small group size leads to a small diameter which results in a good pruning ability of the
SIR-tree.
Effect of k. The parameter k is used by the query issuer to flexibly define the social con-
straint. In Figure 5.7, we examine the query performance by varying the social parameter
k. A larger k means that the returned group has a tighter cohesiveness. That is, each mem-
ber should be familiar with more other members. We can observe that a larger k results
in better performance because it implies a smaller social diameter, which in turn allows
to prune out more users from the search space. Compared to Baseline and Incremental,
SIRBased achieves consistently better query performance for different k values.
Effect of the number of riders. In this set of experiments, we show the performance of
94
10
100
1000
10000
100000
1 2 3 4
Run
ning
tim
e (m
s)
k (s=5)
BaselineIncrementalSIRBased
(a) Brightkite(America)
10
100
1000
10000
100000
1 2 3 4
Run
ning
tim
e (m
s)
k (s=5)
BaselineIncrementalSIRBased
(b) Brightkite(Europe)
10
100
1000
10000
100000
1e+006
1 2 3 4
Run
ning
tim
e (m
s)
k (s=5)
BaselineIncrementalSIRBased
(c) Gowalla(America)
10
100
1000
10000
100000
1e+006
1 2 3 4
Run
ning
tim
e (m
s)
k (s=5)
BaselineIncrementalSIRBased
(d) Gowalla(Europe)
Figure 5.7: Running time vs. k value
the algorithms under various numbers of riders (i.e., the size of the rider space) in Fig-
ure 5.8. We randomly extract several subsets of the rider space to evaluate the algorithms’
performance. As expected, the result demonstrates that SIRBased achieves the best query
efficiency in all cases. Compared to Incremental and SIRBased, Baseline is more sensitive
to the number of riders. Its query processing time increases rapidly with the increase of
the number of riders.
400
800
1200
1600
2000 4000 6000 8000 10000
Run
ning
tim
e (m
s)
# of riders (k=3, s=5)
BaselineIncrementalSIRBased
(a) Brightkite(America)
0
2000
4000
6000
8000
4000 8000 12000 16000 20000
Run
ning
tim
e (m
s)
# of riders (k=3, s=5)
BaselineIncrementalSIRBased
(b) Gowalla(Europe)
Figure 5.8: Running time vs. the number of riders
Pruning capabilities of different strategies. In Figure 5.9, we show the query perfor-
mance of different pruning strategies. Here, we report the different strategies used in
Incremental and SIRBased, where IC, DB, NB and SIR stand for incremental computation
95
0
3000
6000
9000
12000
1 2 3 4R
unni
ng ti
me
(ms)
k (s=5)
ICIC+DBIC+DB+NBIC+DB+NB+SIR
(a) Brightkite(America)
0
5000
10000
15000
20000
4 5 6 7
Run
ning
tim
e (m
s)
s (k=3)
ICIC+DB
IC+DB+NB IC+DB+NB+SIR
(b) Gowalla(Europe)
Figure 5.9: Pruning abilities of different schemes
0
20
40
60
1 2 3 4
Tra
vel c
ost
k (s=5)
Brightkite (America)Gowalla (Europe)
0
20
40
60
4 5 6 7
Tra
vel c
ost
s (k=3)
Brightkite (America)Gowalla (Europe)
Figure 5.10: Travel cost vs. k or s
of core number, social diameter-based bounding, neighbor-based bounding and SIR-tree
based pruning, respectively. In general, all the strategies help to reduce the running time.
In particular, NB and SIR are more effective than others when the k value is small or
when the s value is large. As explained in Section 5.3.3, neighbor-based bounding usual-
ly helps to find a relatively tight group travel cost lower bound as early as possible which
is beneficial for pruning the search space in future iterations.
Travel costs of returned groups. Finally, we demonstrate the average group travel costs
of the query results. We can see that when k or s increases, the travel cost also increases.
The reason is that for a larger k value, it is more difficult to form a group with tight social
relations while being close to the query issuer, thus the travel cost increases accordingly.
On the other side, when the group size increases, more users are included in the returned
group, making the average travel cost increase as per the definition of travel cost. An
interesting observation is that when the rider space is larger, the travel cost of the returned
group is smaller. This is because when the rider space is larger, there are more candidate
riders near the query issuer, giving more opportunities to form a ridesharing group with a
smaller travel cost.
96
5.6 Summary
In this chapter, we have introduced a newly practical type of SaRG queries that investigate
ridesharing problem with flexible social constraints. An SaRG query aims to find a group
of riders where each rider’s ridesharing route is close to the query issuer and each rider
in this group should be familiar with k other members. We proposed several efficient
algorithms to tackle the SaRG queries. An extensive empirical study on real datasets
demonstrates that the proposed algorithms achieve desirable query performance.
97
Chapter 6
Conclusions and Future Work
6.1 Conclusions
In this thesis, we have identified several real-life group queries given the new emergence
of geo-social data in location-based social networks. Our contributions made in this thesis
are summarized as follows:
• We firstly proposed a new type of SIG queries that finds a k-size maximum interest
group in location-based social networks and proving that the SIG query problem
is NP-complete. Two efficient algorithms IOAIR and DOAIR based on the IR-
tree have been developed for the processing of SIG queries. Extensive empirical
evaluation on real datasets validated the performance efficiency of the proposed
query processing algorithms.
• We secondly formulated another type of GSKCG queries, which is of practical use-
fulness in many real-life applications. We formally proved that this problem is NP-
complete. We have proposed the algorithm KCGFinder to answer GSKCG queries
and improved its performance by exploring a set of effective pruning techniques
from different perspectives. We designed a novel index structure, the Enhanced
Social-aware R-tree (SaR-tree) to provide extra pruning capabilities on top of the
pruning techniques developed for KCGFinder. We have also developed the algo-
rithm SaRBasedKCGFinder that integrates KCGFinder and the Enhanced SaR-tree
98
structure. Extensive experiments on real-life datasets demonstrated that our pro-
posed algorithm performs well under a wide range of parameter settings.
• We finally developed a new type of SaRG queries to accommodate the real-life need
of considering social comfort and trust in ridesharing. We proved that the SaRG
query is NP-hard. We have proposed an efficient algorithm named RSGExplorer
and a set of efficient pruning techniques to answer SaRG queries. We have also
devised several incremental strategies by reducing repeated computations to further
speed up query processing. We designed a novel index structure, Social-Info R-tree
(SIR-tree), to further prune the search space and then proposed the SIRBased algo-
rithm to integrate the RSGExplorer algorithm and the SIR-tree structure. Experi-
mental results showed that our proposed algorithms achieve desirable performance.
6.2 Future Work
With the research findings obtained above, we plan to further extend our studies so as
to enrich the group query processing techniques. Below we list some open questions for
potential future research:
• Firstly, we plan to extend spatial-aware interest group queries. We will extend it
to a top-k SIG query that finds the best k user groups in a single query. So far
we have not considered the social relationships among users. We will incorporate
social relationships as an important criterion in group formation and develop novel
query processing techniques.
• Secondly, we plan to work on the following extensions for geo-social k-cover group
queries. The social graph used in Chapter 4 is unweighed, we intend to extend our
algorithm to support a weighted social graph. In some cases, we may not need an
exact solution. How to design an efficient approximation algorithm with a tight
approximation bound is also our future work.
• Thirdly, we plan to further investigate social-aware ridesharing group queries. We
99
will attempt to design a general framework of social-aware ridesharing that accom-
modates various mainstream trip matching and social acquaintance options. We are
going to integrate our proposed techniques into a real ridesharing system to evaluate
the practical effectiveness of our proposed SaRG query solutions.
100
Bibliography
[1] N. Agatz, A. Erera, M. Savelsbergh, and W. Wang. Sustainable passenger trans-
portation: Dynamic ridesharing. Erasmus Research Instution of Management, 2009.
[2] N. Agatz, A. Erera, M. Savelsbergh, and X. Wang. Optimization for dynamic ride-
sharing: A review. European Journal of Operational Research, 223(2):295–303,
2012.
[3] N. Armenatzoglou, S. Papadopoulos, and D. Papadias. A general framework for
geo-social query processing. Proc. Int’l Conf. Very Large Data Bases (PVLDB ’13),
6(10):913–924, 2013.
[4] A. Attanasio, J. F. Cordeau, G. Ghiani, and G. Laporte. Parallel tabu search heuristic-
s for the dynamic multi-vehicle dial-a-ride problem. Parallel Computing, 30(3):377–
387, 2004.
[5] E. Badger. Slugging–the people transit miller-mccune. 2011.
[6] B. Balasundaram, S. Butenko, and I. V. Hicks. Clique relaxations in social net-
work analysis: The maximum k-plex problem. Operations Research, 59(1):133–
142, 2011.
[7] R. Baldacci, V. Maniezzo, and A. Mingozzi. An exact method for the car pooling
problem based on lagrangean column generation. Journal of Operations Research,
52(3):422–439, 2004.
[8] V. Batagelj and M. Zaversnik. An o(m) algorithm for cores decomposition of net-
works. CoRR, 2003.
101
[9] R. W. Calvo, F. de Luigi, P. Haastrup, and V. Maniezzo. A distributed geographic in-
formation system for the daily carpooling problem. Computer Operation Research,
31(13):2263–2278, 2004.
[10] X. Cao, G. Cong, C. S. Jensen, and B. C. Ooi. Collective spatial keyword querying.
In Proc. ACM Int’l Conf. Management of Data (SIGMOD ’11), pages 373–384,
2011.
[11] L. Chen, G. Cong, C. S. Jensen, and D. Wu. Spatial keyword query processing: An
experimental evaluation. In Proc. Int’l Conf. Very Large Data Bases (PVLDB ’13),
pages 217–228, 2013.
[12] S.-J. Chen and L. Lin. Modeling team member characteristics for the formation of a
multifunctional team in concurrent engineering. IEEE Transactions on Engineering
Management, 51(2):111–124, 2004.
[13] T. Chen, M. A. Kaafar, and R. Boreli. The where and when of finding new friends:
Analysis of a location-based social discovery network. In Proc. Int’l Conf. Web and
Social Media (ICWSM ’13), pages 329–336, 2013.
[14] J. Cheng, Y. Ke, S. Chu, and C. Cheng. Efficient processing of distance queries in
large graphs: A vertex cover approach. In Proc. ACM SIGMOD Int’l Management
of Data (SIGMOD ’12), pages 457–468, 2012.
[15] J. Cheng, Y. Ke, S. Chu, and M. T. Ozsu. Efficient core decomposition in massive
networks. In Proc. IEEE Int’l Conf. Data Engineering (ICDE ’11), pages 51–62,
2011.
[16] B. Cici, A. Markopoulou, E. Frias-Martinez, and N. Laoutaris. Assessing the po-
tential of ride-sharing using mobile and social data: A tale of four cities. In Proc.
ACM Int’l Conf. Pervasive and Ubiquitous Computing (UbiComp ’14), pages 34–43,
2014.
[17] S. L. N. D. Collection. Online available at: http://snap.stanford.edu/.
102
[18] G. Cong, C. S. Jensen, and D. Wu. Efficient retrieval of the top-k most relevant
spatial web objects. In Proc. Int’l Conf. Very Large Data Bases (PVLDB ’09), pages
337–348, 2009.
[19] G. Cong, H. Lu, B. C. Ooi, D. Zhang, and M. Zhang. Efficient spatial keyword
search in trajectory databases. CoRR, 2012.
[20] J. Cordeau. A branch-and-cut algorithm for the dial-a-ride problem. Journal of
Operations Research, 54(1):573–586, 2003.
[21] J. F. Cordeau and G. Laporte. The dial-a-ride problem: Models and algorithms.
Annals of Operations Research, 153(1):29–46, 2007.
[22] A. Corral, Y. Manolopoulos, Y. Theodoridis, and M. Vassilakopoulos. Closest pair
queries in spatial databases. SIGMOD Record, 29(2):189–200, 2000.
[23] I. De Felipe, V. Hristidis, and N. Rishe. Keyword search on spatial databases. In
Proc. IEEE Int’l Conf. Data Engineering (ICDE ’08), ICDE ’08, pages 656–665,
2008.
[24] P. M. d’Orey, R. Fernandes, and M. Ferreira. Empirical evaluation of a dynamic and
distributed taxi-sharing system. In Proc. IEEE Int’l Conf. Intelligent Transportation
Systems (CITS ’12), pages 140–146, 2012.
[25] Y. Doytsher, B. Galon, and Y. Kanza. Querying geo-social data by bridging spatial
networks and social networks. In Proc. ACM Int’l Workshop on Location Based
Social Networks (LBSN ’10), pages 39–46, 2010.
[26] J. Fan, G. Li, L. Zhou, S. Chen, and J. Hu. Seal: Spatio-textual similarity search.
CoRR, 2012.
[27] N. Garg, G. Konjevod, and R. Ravi. A polylogarithmic approximation algorithm for
the group steiner tree problem. In Proc. ACM-SIAM Int’l Symposium on Discrete
Algorithms (SODA ’98), pages 253–259, 1998.
103
[28] A. Guttman. R-trees: A dynamic index structure for spatial searching. In Proc. ACM
Int’l Conf. Management of Data (SIGMOD ’84), pages 47–57, 1984.
[29] F. Harary and I. C. Ross. A procedure for clique detection using the group matrix.
Sociometry, 1957.
[30] G. R. Hjaltason and H. Samet. Incremental distance join algorithms for spatial
databases. In Proc. ACM SIGMOD Int’l Management of Data (SIGMOD ’98), pages
237–248, 1998.
[31] G. R. Hjaltason and H. Samet. Distance browsing in spatial databases. ACM Trans-
actions on Database Systems, 24(2):265–318, 1999.
[32] Y. Huang, R. Jin, F. Bastani, and X. S. Wang. Large scale real-time ridesharing with
service guarantee on road networks. In Proc. Int’l Conf. Very Large Data Bases
(PVLDB ’14), pages 2017–2028, 2014.
[33] R. M. Karp. Reducibility among combinatorial problems. In Complexity of Com-
puter Computations, pages 85–103. 1972.
[34] N. Katayama and S. Satoh. The sr-tree: An index structure for high-dimensional
nearest neighbor queries. In Proc. ACM Int’l Conf. Management of Data (SIGMOD
’97), pages 369–380, 1997.
[35] M. Kolahdouzan and C. Shahabi. Voronoi-based k nearest neighbor search for spa-
tial network databases. In Proc. Int’l Conf. Very Large Data Bases (PVLDB ’04),
pages 840–851, 2004.
[36] A. H. Land and A. G. Doig. An automatic method for solving discrete programming
problems. 50 Years of Integer Programming 1958-2008, pages 105–132, 2010.
[37] T. Lappas, K. Liu, and E. Terzi. Finding a team of experts in social networks. In
Proc. ACM Int’l Conf. Knowledge Discovery and Data Mining (SIGKDD ’09), pages
467–476, 2009.
104
[38] C.-T. Li and M.-K. Shan. Team formation for generalized tasks in expertise social
networks. In Proc. IEEE Int’l Conf. Social Computing (ICSC ’10), pages 9–16,
2010.
[39] Y. Li, R. Chen, J. Xu, Q. Huang, H. Hu, and B. Choi. Geo-social group queries
with minimum acquaintance constraint. IEEE Transactions on Knowledge and Data
Engineering, accepted to appear.
[40] Y. Li, D. Wu, J. Xu, B. Choi, and W. Su. Spatial-aware interest group queries in
location-based social networks. Data and Knowledge Engineering, 92:20–38, 2014.
[41] W. Liu, W. Sun, C. Chen, Y. Huang, Y. Jing, and K. Chen. Circle of friend query in
geo-social networks. In Proc. Int’l Conf. Database Systems for Advanced Applica-
tions (DASFAA ’12), pages 126–137, 2012.
[42] C. Long, R. C.-W. Wong, K. Wang, and A. W.-C. Fu. Collective spatial keyword
queries: A distance owner-driven approach. In Proc. ACM Int’l Conf. Management
of Data (SIGMOD ’13), pages 689–700, 2013.
[43] J. Lu, Y. Lu, and G. Cong. Reverse spatial and textual k nearest neighbor search. In
Proc. ACM Int’l Conf. Management of Data (SIGMOD ’11), pages 349–360, 2011.
[44] S. Ma and O. Wolfson. Analysis and evaluation of the slugging form of ridesharing.
In Proc. ACM Int’l Conf. Advances in Geographic Information Systems (SIGSPA-
TIAL ’13), pages 64–73, 2013.
[45] S. Ma, Y. Zheng, and O. Wolfson. T-share: A large-scale dynamic taxi ridesharing
service. In Proc. IEEE Int’l Conf. Data Engineering (ICDE ’13), pages 410–421,
2013.
[46] B. Martins, M. J. Silva, and L. Andrade. Indexing and ranking in geo-ir systems. In
Proc. ACM SIGSPATIAL Int’l Workshop on Geographic Information Retrieval (GIR
’05), pages 31–34, 2005.
105
[47] B. Mcclosky and I. V. Hicks. Combinatorial algorithms for the maximum k-plex
problem. Journal of Combinatorial Optimization, 23(1):29–49, 2012.
[48] B.-U. Pagel, H.-W. Six, H. Toben, and P. Widmayer. Towards an analysis of range
query performance in spatial data structures. In Proc. ACM SIGACT-SIGMOD-
SIGART Symp. Principles of Database Systems (PODS ’93), pages 214–221, 1993.
[49] D. Papadias, Q. Shen, Y. Tao, and K. Mouratidis. Group nearest neighbor queries.
In Proc. IEEE Int’l Conf. Data Engineering (ICDE ’04), pages 301–312, 2004.
[50] M. Rigby, A. Kruger, and S. Winter. An opportunistic client user interface to support
centralized ride share planning. In Proc. ACM Int’l Conf. Advances in Geographic
Information Systems (SIGSPATIAL ’13), pages 34–43, 2013.
[51] J. B. Rocha-Junior, O. Gkorgkas, S. Jonassen, and K. Nørvag. Efficient processing of
top-k spatial keyword queries. In Proc. Int’l Conf. Advances in Spatial and Temporal
Databases (SSTD ’11), pages 205–222, 2011.
[52] N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In Proc. ACM
Int’l Conf. Management of Data (SIGMOD ’95), pages 71–79, 1995.
[53] A. E. Sarıyuce, B. Gedik, G. Jacques-Silva, K.-L. Wu, and U. V. Catalyurek. Stream-
ing algorithms for k-core decomposition. Proc. Int’l Conf. Very Large Data Bases
(PVLDB ’13), 6(6):433–444, 2013.
[54] S. Scellato, A. Noulas, R. Lambiotte, and C. Mascolo. Socio-spatial properties of
online location-based social networks. In Proc. Int’l Conf. Web and Social Media
(ICWSM ’15), pages 329–336, 2011.
[55] S. B. Seidman. Network structure and minimum degree. Social Networks, 5:269–
287, 1983.
[56] J. Shi, N. Mamoulis, D. Wu, and D. W. Cheung. Density-based place clustering in
geo-social networks. In Proc. ACM Int’l Conf. Management of Data (SIGMOD ’14),
pages 99–110, 2014.
106
[57] H. Shin, B. Moon, and S. Lee. Adaptive multi-stage distance join processing. SIG-
MOD Record, 29(2):343–354, 2000.
[58] M. Sozio and A. Gionis. The community-search problem and how to plan a success-
ful cocktail party. In Proc. ACM Int’l Conf. Knowledge Discovery and Data Mining
(SIGKDD ’10), pages 939–948, 2010.
[59] Y. Tao, X. Xiao, and R. Cheng. Range search on multidimensional uncertain data.
ACM Transactions on Database Systems, 32(3):15–54, 2007.
[60] N. Z. S. D. Tours. http://www.newzealandselfdrivetours.co.nz/,
2015.
[61] A. T. U. S.-D. R. Trips. http://www.autotoursusa.com/, 2015.
[62] K. Tsubouchi, K. Hiekata, and H. Yamato. Scheduling algorithm for on-demand bus
system. Information Technology: New Generations, 2009.
[63] D. Wu, G. Cong, and C. S. Jensen. A framework for efficient spatial web object
retrieval. Journal of Very Large Data Bases, 21(6):797–822, 2012.
[64] D. Wu, M. L. Yiu, C. S. Jensen, and G. Cong. Efficient continuously moving top-
k spatial keyword query processing. In Proc. IEEE Int’l Conf. Data Engineering
(ICDE ’11), pages 541–552, 2011.
[65] Z. Xiang, C. Chu, and H. Chen. A fast heuristic for solving a large-scale static
dial-a-ride problem under complex constraints. European Journal of Operational
Research, 174(2):1117–1139, 2006.
[66] S. Yan and C. Y. Chen. An optimization model and a solution algorithm for the
many-to-many car pooling problem. Annals of Operations Research, 191:37–71,
2011.
[67] D.-N. Yang, Y.-L. Chen, W.-C. Lee, and M.-S. Chen. On social-temporal group
query with acquaintance constraint. In Proc. Int’l Conf. Very Large Data Bases
(PVLDB ’11), pages 397–408, 2011.
107
[68] D.-N. Yang, C.-Y. Shen, W.-C. Lee, and M.-S. Chen. On socio-spatial group query
for location-based social networks. In Proc. ACM Int’l Conf. Knowledge Discovery
and Data Mining (SIGKDD ’12), pages 949–957, 2012.
[69] N. J. Yuan, Y. Zheng, L. Zhang, and X. Xie. T-finder: A recommender system
for finding passengers and vacant taxis. IEEE Transations on Knowledge and Data
Engineering, 25(10):2390–2403, 2013.
[70] D. Zhang, Y. M. Chee, A. Mondal, A. Tung, and M. Kitsuregawa. Keyword search
in spatial databases: Towards searching by document. In Proc. IEEE Int’l Conf.
Data Engineering (ICDE ’09), pages 688–699, 2009.
[71] Y. Zhang and S. Parthasarathy. Extracting analyzing and visualizing triangle k-core
motifs within networks. In Proc. IEEE Int’l Conf. Data Engineering (ICDE ’12),
pages 1049–1060, 2012.
[72] W. Zhao, Y. Qin, D. Yang, L. Zhang, and W. Zhu. Social group architecture based
distributed ride-sharing service in vanet. Journal of Distributed Sensor Networks,
2014:1–8, 2014.
[73] Y. Zhou, X. Xie, C. Wang, Y. Gong, and W.-Y. Ma. Hybrid index structures for
location-based web search. In Proc. ACM Int’l Conf. Information and Knowledge
Management (CIKM ’05), pages 155–162, 2005.
[74] Q. Zhu, H. Hu, J. Xu, and W.-C. Lee. Geo-social group queries with minimum
acquaintance constraint. CoRR, abs/1406.7367, 2014.
[75] J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing
Surveys, 38(2):1–56, 2006.
[76] A. Zzkarian and A. Kusiak. Forming teams: an analytical approach. IIE Transac-
tions, 31(1):85–97, 1999.
108
Curriculum Vitae
Academic qualifications of the thesis author, Mr. Yafei LI:
• Received the degree of Bachelor of Engineering from Henan Normal University,
July 2006.
• Received the degree of Master of Engineering from Suzhou University, July 2009.
June 2015
109
Recommended