116
Exploring Behavioral Data in Online Social Media with Focus on User Connectivity and Mobility by Hongwei Liang B.Sc., Harbin Institute of Technology, China, 2012 Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in the School of Computing Science Faculty of Applied Science c Hongwei Liang 2018 SIMON FRASER UNIVERSITY Spring 2018 Copyright in this work rests with the author. Please ensure that any reproduction or re-use is done in accordance with the relevant national copyright legislation.

Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Exploring Behavioral Data in Online SocialMedia with Focus on User Connectivity and

Mobilityby

Hongwei Liang

B.Sc., Harbin Institute of Technology, China, 2012

Dissertation Submitted in Partial Fulfillment of the

Requirements for the Degree of

Doctor of Philosophy

in the

School of Computing Science

Faculty of Applied Science

c© Hongwei Liang 2018SIMON FRASER UNIVERSITY

Spring 2018

Copyright in this work rests with the author. Please ensure that any reproduction or re-useis done in accordance with the relevant national copyright legislation.

Page 2: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Approval

Name: Hongwei Liang

Degree: Doctor of Philosophy (Computing Science)

Title: Exploring Behavioral Data in Online Social Media withFocus on User Connectivity and Mobility

Examining Committee: Chair: Martin EsterProfessor

Ke WangSenior SupervisorProfessor

Jian PeiSupervisorProfessor

Jiangchuan LiuInternal ExaminerProfessor

Xiaofang ZhouExternal ExaminerProfessorSchool of Information Technology and Electrical EngineeringThe University of Queensland

Date Defended: 23 April 2018

ii

Page 3: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Abstract

With the booming development of online social media in recent years, massive and variety of be-

havioral data, such as social interactions data and user’s E-travel sharing data, are generated by the

users throughout the world everyday. Exploring and analyzing such data helps to understand users’

preferences, unearth the contained tremendous knowledge, and identify new problems and business

opportunities, etc. In this thesis, we are specifically interested in the user connectivity/interaction

behaviors, e.g., friendship creation, and the mobility behaviors, e.g., check-in sequence at Point-of-

Interest (POIs), that involve rich semantic information on nodes and edges of the social networks,

and study three practical problems in different applications.

We first analyze users’ social connectivity behaviors from a new angle and study a problem of

mining non-homophily social ties, aiming at discovering interesting but unexpected group-level

social ties that do not follow the homophily phenomenon. We propose a novel ranking metric to

identify such social ties and develop an efficient mining algorithm specifically for the new metric.

In our second work, we explore users’ check-in sequences or travel routes, and study a problem of

personalized trip recommendation meets real-world constraints, by considering personalized rating

on POIs and multiple constraints such as the time budget, the time window for the POI availability,

the uncertainty of traveling time between POIs. We develop two efficient optimal solutions and two

heuristic solutions for finding “good trips” with a significantly better runtime.

Finally, in consideration of the sparsity of users’ historical rating data and people’s dynamically

changed mind over time, we further study an on-demand route search problem with personalized

diversity requirement on POIs, where users can specify their preferred features for the route and a

personalized quantity (number of POIs) and variety (the coverage of the specified features) trade-

offs. We propose to model users’ personalized route diversity requirement by submodular functions

that support the diminishing marginal utility property. We design generic and elegant optimal al-

gorithm as well as heuristic algorithms. Comprehensive empirical evaluations on real life data sets

demonstrate the effectiveness and efficiency of our methods.

Keywords: User behavior analytics; Social tie mining; Trip route recommendation and search

iii

Page 4: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Dedication

Dedicated to my parents, my sister, and my wife.

iv

Page 5: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Acknowledgements

I wish to first express my sincere gratitude to my senior supervisor Dr. Ke Wang, who provided

great guidance and support to me on both academic research and life during my Ph.D. studies. I

would also like to thank my supervisor Dr. Jian Pei for his insightful feedback and suggestions on

improving the quality of my thesis, as well as Dr. Martin Ester, Dr. Jiangchuan Liu and Dr. Xiaofang

Zhou for spending their valuable time on serving as the chair and examiners for my Ph.D. thesis

defence.

I am also very grateful to all my collaborators, lab mates, colleagues and friends around me for

their kind help throughout my study, career and life so far.

Finally, I would like to express my special thanks to my parents, my sister and my wife, for their

continuous and unconditional supports, encouragement and love.

v

Page 6: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Table of Contents

Approval ii

Abstract iii

Dedication iv

Acknowledgements v

Table of Contents vi

List of Tables ix

List of Figures x

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Social Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.2 User Behavior Taxonomy and Behavioral Analytics . . . . . . . . . . . . . 2

1.1.3 Topics on Social Connectivity and Interaction Behaviors . . . . . . . . . . 3

1.1.4 Topics on Mobility Behaviors . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Proposed Research Problems and Contributions . . . . . . . . . . . . . . . . . . . 6

1.2.1 Mining Non-homophily Social Ties . . . . . . . . . . . . . . . . . . . . . 7

1.2.2 Personalized Trip Recommendation Meets Real-world Constraints . . . . . 8

1.2.3 Route Search with Personalized Diversity Requirement on POIs . . . . . . 9

1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Related Work 112.1 Social Tie Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 Graph Mining in Social Networks . . . . . . . . . . . . . . . . . . . . . . 11

2.1.2 Information Network Analysis . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.3 Association Rule Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Route Recommendation and Search . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.1 Stand-alone Location Recommendation/Search . . . . . . . . . . . . . . . 14

vi

Page 7: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

2.2.2 Sequential Location Recommendation/Planning . . . . . . . . . . . . . . . 15

2.2.3 Trajectory Retrieval and Patterns Mining . . . . . . . . . . . . . . . . . . 16

2.2.4 Operation Research and Scheduling . . . . . . . . . . . . . . . . . . . . . 16

3 Mining Non-homophily Social Ties 183.1 Motivations and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Problem Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 Group Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.2 Non-homophily Preference . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.3 Top-k GRs Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4 Mining Top-k GRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4.1 Pruning Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4.2 Subset-First Depth-First (SFDF) Enumeration . . . . . . . . . . . . . . . . 28

3.4.3 Computing Non-homophily Preference . . . . . . . . . . . . . . . . . . . 30

3.4.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5.2 Interestingness Study for Pokec Data . . . . . . . . . . . . . . . . . . . . 34

3.5.3 Interestingness Study for DBLP Data . . . . . . . . . . . . . . . . . . . . 36

3.5.4 Efficiency of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.6 Summary and Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Personalized Trip Recommendation Meets Real-world Constraints 414.1 Motivations and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 Problem Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2.1 Personalized Trip Recommendation Problem . . . . . . . . . . . . . . . . 45

4.3 Modeling Preferences and Constraints . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3.1 Estimating User Preferences . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3.2 Modeling Time budget and POI Availability Constraints . . . . . . . . . . 47

4.3.3 Modeling Uncertain Traveling Time . . . . . . . . . . . . . . . . . . . . . 47

4.3.4 Modeling POI Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.4 Optimal Method: State Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.4.1 Dominance of States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.4.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.5 Optimal Method: Prefix Based Depth-first Search . . . . . . . . . . . . . . . . . . 52

4.5.1 Prefix Based Depth-first Search . . . . . . . . . . . . . . . . . . . . . . . 53

4.5.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.6 Heuristic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

vii

Page 8: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

4.6.1 State Relaxing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.6.2 Heuristic Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.7 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.7.2 Rating Accuracy of Individual POIs . . . . . . . . . . . . . . . . . . . . . 59

4.7.3 The Fixed Traveling Time Model Without Diversity Constraint . . . . . . . 61

4.7.4 The Uncertain Traveling Time Model . . . . . . . . . . . . . . . . . . . . 65

4.7.5 Effect of Diversity Constraint . . . . . . . . . . . . . . . . . . . . . . . . 67

4.8 Summary and Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5 Route Search with Personalized Diversity Requirement on POIs 685.1 Motivations and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.2 Problem Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.2.1 Top-k Route Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2.2 Modeling Route Diversity Requirement . . . . . . . . . . . . . . . . . . . 73

5.2.3 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.3 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.3.1 Offline Index Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.3.2 Online Sub-index Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.4 Optimal Routes Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.4.1 Search Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.4.2 Cost-based Pruning Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.4.3 Gain-based Pruning Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.4.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.4.5 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.5 Heuristic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.6.2 Performance Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.6.3 Comparison with A* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.7 Summary and Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6 Conclusion 926.1 Summary of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Bibliography 95

Appendix A List of Publications 104

viii

Page 9: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

List of Tables

Table 3.1 Frequently used notations . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Table 3.2 Comparison of top GRs ranked by nhp and conf for Pokec data set . . . . . 35

Table 3.3 Comparison of top GRs ranked by nhp and conf for DBLP data set . . . . . 36

Table 4.1 Frequently used notations . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Table 4.2 RMSE and MAE. Lower values are better . . . . . . . . . . . . . . . . . . . 60

Table 4.3 Paired t-Test(2-tail) of FCF and baselines . . . . . . . . . . . . . . . . . . . 61

Table 4.4 Pruning power of the SE/PDFS algorithms in various time budget settings,

the numbers are the percentage of the pruned states by the algorithms . . . . 63

Table 4.5 Comparison of the averaged happiness over the 100 testing users for the case

of β = 1 and β > 1, gained from the optimal routes recommended by PDFS.

The numbers in brackets indicate that among the 100 generated optimal re-

sults with the constraint β = 1 for the testing users how many are also optimal

with the constraint β equals to 2, 3 or 4 . . . . . . . . . . . . . . . . . . . . 66

Table 5.1 Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Table 5.2 Dataset statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

ix

Page 10: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

List of Figures

Figure 1.1 Social Network Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Figure 1.2 Response patterns in the Facebook dating app, Are You Interested1. The

numbers represent the percentage of people responding to a “yes” on the

app, by gender and ethnicity. . . . . . . . . . . . . . . . . . . . . . . . . 7

Figure 1.3 Example of trip recommendation . . . . . . . . . . . . . . . . . . . . . . 9

Figure 3.1 A toy dating network with attributes on nodes . . . . . . . . . . . . . . . 18

Figure 3.2 Data structure: LArray, EArray and RArray . . . . . . . . . . . . . . . . 26

Figure 3.3 Subset-First Depth-First enumeration, with dynamic ordering of homophily

attributes (shown in the blue dashed box) . . . . . . . . . . . . . . . . . 29

Figure 3.4 Runtime for mining GRs for Pokec data . . . . . . . . . . . . . . . . . . 38

Figure 4.1 Example of trip recommendation. . . . . . . . . . . . . . . . . . . . . . . 42

Figure 4.2 Prefix based depth-first compact state enumeration tree. The number indi-

cates the order of enumeration. . . . . . . . . . . . . . . . . . . . . . . . 53

Figure 4.3 The fixed traveling time model: (left) happiness of trip routes found (y-

axis) vs time budget (x-axis); (right) average runtime (y-axis) vs time bud-

get (x-axis). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Figure 4.4 Case study of recommended trips for LA, with the happiness of each trip

in bracket. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Figure 4.5 The uncertain traveling time model: (left) happiness of trip routes found

(y-axis) by SE-SR and PDFS vs time budget b (x-axis); (right) average

runtime (y-axis) vs time budget b (x-axis). . . . . . . . . . . . . . . . . . 65

Figure 5.1 A sample POI map. Each node vi represents a POI with 3 features (Park,

Museum, Restaurant). Each feature having a numeric rating in the range

[0, 1], indicated by the vector aside the POI. Each edge has an associated

cost of traveling the edge. . . . . . . . . . . . . . . . . . . . . . . . . . 69

Figure 5.2 Rh(i)−α vs. rank Rh(i) for varied α . . . . . . . . . . . . . . . . . . . . 74

Figure 5.3 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Figure 5.4 Left part: FI and HI built from the POI map in Figure 5.1. Right Part: Given

a queryQ, retrieve POI candidates VQ by retrieving the subindices FIQ and

HIQ from FI and HI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

x

Page 11: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Figure 5.5 Experimental results for Singapore. Run time and search space (# of routes)

are in logarithmic scale. The labels beside data points indicate the ratio of

queries successfully responded by the algorithm under the parameter set-

ting. No label if no query fail. Data point or bar is not drawn if more than

half fail. AP can only respond queries with small b. GR and AP can only

find top-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Figure 5.6 Experimental results for Austin . . . . . . . . . . . . . . . . . . . . . . . 87

Figure 5.7 Two routes found from Singapore by PACER+2 for the queryQ = (x, y, b =9,w = (P : 0.4,M : 0.3, R : 0.3),θ = 2.5,α), where x and y are Hilton

Singapore, and P, M and R represent Park, Museum, and Chinese Restaurant. 89

Figure 5.8 PACER+2 vs. A* (logarithmic scale). . . . . . . . . . . . . . . . . . . . . 90

xi

Page 12: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Chapter 1

Introduction

Variety of online social media services sprang up like mushrooms with the rise of Web 2.0, and

they have been booming extremely in recent years. Some of the most popular social media, such as

Facebook, Twitter, YouTube, WeChat, Foursqure and Instagram, etc., possess a massive number of

users, and users are allowed to view, share and publish content, interact with other users, purchase

and sell products, etc. on these online social media. Such ubiquitously observed user behaviors

and interactions create hundreds of Petabyte of data each year. According to [72], Facebook has

2.13 billion monthly active users as of the fourth quarter of 2017, and the users spend an average

of 20 minutes per day on the site and they consequently generate 4 new Petabytes of data per

day recording the various user behaviors and activities. Tremendous knowledge are contained in

the massive behavioral data and exploring them helps to make sense of observations, identify new

problems and discover business opportunities.

1.1 Background

1.1.1 Social Network

The behaviors regarding to users’ engagement with the social media and the interactions among the

entities form the skeleton of the social media - Social Networks. A social network can be conve-

niently represented by a graph G = (V, E), where each entity in the social network represented by

a node v ∈ V and the interaction or (social) tie between two entities represented by a directed /

undirected edge e ∈ E , and E ⊆ V × V . Normally, each node and/or edge in a social network has

associated attributes, features, weights or behaviors indicating the semantic information on node

and/or edge. Therefore, social network is a kind of information network as is defined by [88]. Fig-

ure 1.1a gives a toy social network with associated semantic information.

Social networks can be heterogeneous, where multiple classes of nodes and/or edges are in-

cluded. A representative heterogeneous social network is Location-based Social Network (LBSN),

as shown in Figure 1.1b (this figure is from [4]). A LBSN includes users, locations and other

location-tagged user-generated content, and the interactions or relationships among different type

of entities form the user graph, location graph and user-location graph respectively. An edge in the

1

Page 13: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

SEX RACE LOCATOIN

F Asian US5

Graph topology

Profile

Behaviors

JoiningcommunitiesMessagingSharingcontent

(a) A toy social network with associated rich se-mantic information CMPT884: Social Media Mining

Different Networks in LBSN

 New relations and correlations created by LBSN

Locations

Use

r-Lo

catio

n G

raph

Users

Trajectories

User Graph

User Correlation

Location Graph

Location Correlation

Location-tagged user-generated content

10 (b) A sample location-based social networks [4]

Figure 1.1: Social Network Examples

user graph presents the physical distance between two users or historical co-visits, etc, and in the

location graph usually presents the physical distance or travel cost between two locations, and in

the user-location graph indicates the user’s visiting frequency to the location.

Social networks have some typical structural properties that are commonly observed in real-

world networks. For example, the degree distribution of nodes follows a power-law distribution [36],

and the small-world phenomenon is commonly observed such that the average path length between

pairs of nodes is usually small, e.g., 4.25 in Orkut [119]. Besides, certain distinguishable patterns

can be observed when individuals are connected or having interactions in social networks, such

as the homophily phenomenon [75] that similar individuals are more likely to connect each other

in social networks, and the weak tie theory [37] that more novel information flows to individuals

through weak ties (the connections with low frequency of interactions) rather than strong ties. A

comprehensive summary of such structural properties and patterns is presented in [119].

1.1.2 User Behavior Taxonomy and Behavioral Analytics

As is mentioned, ubiquitous user behaviors are observed in various social networks. Based on our

knowledge and also by referring to [44], we categorize major user behaviors in online social net-

works according to involved subjects into the following four classes:

• Social connectivity and interaction behavior, occurred when users and other entities in on-

line social networks establish connections and interact with each other. Some examples are friend-

ship creation, following/unfollowing, joining community, messaging, viewing/sharing/commenting

on friends’ posted content, solving human intelligent tasks in online crowdsourcing services.

•Mobility behavior, referred to the behaviors specifically relying on real-life locations or spa-

tial information in the environment of social media. Some typical such behaviors are: online check-

in on LBSN, trajectory/travel route sharing, transport activities, augmented reality gaming, etc.

• Content publishing behavior, occurred when users generate new content on social media.

The involved behaviors are profiling, writing reviews, and uploading content, etc.

2

Page 14: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

• Platform traffic behavior, indicating those user activities that are recorded by social media

sites monitoring, including creating accounts, migrating across sites, web browsing, click stream,

searching and purchasing decision, malicious behaviors, etc.

Exploring and analyzing behavioral data is essential and beneficial to multiple stakeholders. For

social media users, customized services can be provided to facilitate them to better consume the

services and enhance user experience. For service providers, by providing satisfying services, more

users will be attracted, which brings profits to the businesses. And for governments, the knowledge

obtained from citizens’ behavioral data helps to understand the public opinions and make the so-

ciety better. However, this task is not trivial because the behavioral data is massive, changing fast,

high-dimensional, unstructured and noisy. Owing to its significance and challenges, user behavioral

analytics attracts more and more attentions from both academic and industrial communities, and has

opened up an array of research problems. The research works have different focuses according to

the type of user behavior (as mentioned above) they analyze.

We are specifically interested in the first two types of user behavior, i.e., social connectiv-

ity/interaction behavior and mobility behavior. No matter in real life or social networks, social con-

nectivity and interaction are the essence of social contact and the intrinsic reason why our world or

a social network becomes more and more densely connected. Exploring the connectivity and inter-

action behaviors in online social media help to reveal the way people connect, interact and exchange

information with each other, deeply understand users’ preference and identify business opportuni-

ties. For the location data and mobility behaviors in the environment of social media, they bridge the

gap between the physical and digital worlds. And people’s ever-growing dependence on mobile de-

vices of late years triggered a great many location-based services and recreations, Online-To-Offline

(O2O) business, and other emerging applications. The significance of analyzing the mobility behav-

iors is self-evident. Although they are two different types of behavior due to different involved

subjects, they are in fact closely related and share many characteristics. First, they are essentially

both interaction behaviors. While the first type depicts the interaction or relationship between two

users, the second type depicts the moving interaction between one user and locations. Second, they

both are associated with graph/network topology. LBSN is also a kind of social network. In addi-

tion, both types involve the richest semantic information on different classes of nodes and edges of

the social networks and have wide applications in real life.

Though majority of the study on social networks in recent years focus on these two types of

behavior, many open and challenging problems keep emerging thanks to the rapid development and

revolution of social media applications. Next, we first present a brief review of the related main

topics of exploring these two types of behavior are summarized in Section 1.1.3 and 1.1.4, and then

propose the unique problems we study in this thesis in Section 1.2.

1.1.3 Topics on Social Connectivity and Interaction Behaviors

Among the wide range of research studying the social connectivity and interaction behaviors, many

works solely focus on the topological network structures. For example, summarizing a large graph

3

Page 15: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

by densely connected sub-structure patterns [110, 77, 111]. We focus on the brief summary of sev-

eral typical research topics that also consider the contextual information of the network or associated

behaviors in addition to the topological network structure.

Community Analysis

A community in online social media comes into existence when similar or like-mind users establish

a connection and interact with each other [119]. The communities can be either explicit, i.e., the

groups, association and clubs, etc that are explicitly created by users in online social media sites, or

implicit, i.e., the set of users having common interests/characters but are not explicitly grouped, for

example, the individuals having the same taste for certain videos on YouTube.

A main task of analysing community is community detection - finding a set of communities

or densely connected clusters in a social network. The acting objects of community detection are

normally implicit communities, where the members are obscure to many people. Identifying com-

munities help to obtain user interests and provide customized recommendations or services to the

user. Also, communities present a global view of user interaction, some behaviors are only observ-

able on a group level, however, a local-view of individual behavior is often noisy and ad hoc [119].

A general review about community detection in social media can be found in the survey paper [31].

Information Diffusion

Information diffusion is defined as the process by which a piece of information or knowledge

spreads and reaches individuals through interactions [119]. In social networks, the information can

be a product promotion message or a piece of breaking news, etc.

A user of the social networks makes decision (whether to further spread the information) either

depending on the information they receive from others or independently. The studies about informa-

tion diffusion in recent years mainly focus on information cascades [45], which assumes that a user

makes decisions dependently and only depending on her immediate neighbors. This assumption is

natural, e.g., users commonly share the content posted by their friends in the network.

One important thread of research about information cascades is maximizing the spread of cas-

cades/influence. It is motivated by the viral marketing application: by initially targeting a few “in-

fluential” members of the network, we can trigger a cascade of influence by which friends will

recommend the product to other friends, and many individuals will ultimately try it. Then the prob-

lem is: choosing an initial influential set with a budget of k nodes, such that the number of nodes

that get ultimately influenced in the network is maximized [45]. Many extensions or variations of

this problem and follow up works can be found in recent literature, such as [55, 105, 19, 113].

Recommendations in Social Media

In social media, individuals usually face a lot of options to choose from or variety of decisions

to make, which are related to buying product, consuming service, making friend, etc. It motivates

4

Page 16: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

the desire for customized recommender systems that suggest products/services/friends tailored to

individuals’ tastes and helps them with decision making.

Typically, the input of a recommender system is the relationship between a set of user U and a

set of item I. The item can be generalized to people or other material. The relationship indicating

the preference for every user-item pair is usually denoted by a sparse utility matrix M (with most of

the entries missing or unknown). The type of preference values in M could be either explicit, such

as a numeric rating in a fixed range or reviews with explicit user opinions, or implicit, such as page

views and click-stream on social media sites. The goal is to learn a function f that assigns a real

value preference to each user-item pair (u, i).

Although various methodologies are used in recommender systems, they can be classified into

two general classes: content-based methods and collaborative filtering (CF) [57]. While content-

based methods are based on matching the users’ interests/ profile with the description/profile of

items by certain similarity measures, CF methods are based on the idea that users might prefer

the items that are favored/bought by similar users in the past (collective intelligence). CF methods,

especially the model-based CF, are more popular and have been shown to be generally more accurate

and robust in practical usage than content-based methods [107] [119].

Social Tie Analysis

Social ties are links that connect social actors, and are seen as “channels for transfer or flow of

resources (either material or nonmaterial)” [102]. Naturally occurring ties among social actors are

inherently complex and consist of numerous types with different interaction activities involved.

For example, a pair of social actors may have friendship, cooperation, or citation ties. Social ties

have been widely studied in social science [37, 38], and are being paid further attention by the

communities in computing science in recent years.

The works on social tie analysis can be categorized into individual social tie analysis and col-

lective social tie analysis based on whether they are studying individual behavior exhibited by a

single (pair of) user or collective behavior observed when a group of users behave together [119].

A typical task of individual social tie analysis is link prediction, which aims to predict new/deleted

links between nodes for a short future time, or missing/unobserved links in current network [93, 70].

The collective social tie analysis more focuses on widely observed group-level interaction patterns

or aggregated results for large groups. For example, [95] studied the social structure of Facebook

networks and calculated the propensity for two nodes with the same categorical value to form a tie.

A detailed review of the works in this topic will be presented in Section 2.1.1.

1.1.4 Topics on Mobility Behaviors

The booming of LBSN opens the new possibilities of location-based search, prediction, recommen-

dation and pattern mining, etc. We can categorize the topics of mining mobility behaviors into three

main categories: (1) location oriented; (2) user oriented; (3) location-embedded content or activity

5

Page 17: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

oriented. While the location oriented topics treat locations as the main object of study and essen-

tially bring in a stream of new research problems, majority of the rest two categories of topics are

similar to that in the traditional social media, e.g., popular community discovery, friend recommen-

dation as summarized in Section 1.1.3, though the contextual information related to location brings

in new challenges [24]. Therefore, we only summarize some typical location oriented topics.

Stand-alone Locations Prediction or Recommendation

This array of research mainly treats stand-alone locations as the study objective. They either predicts

the locations a user may visit next [79, 32], or suggests her with the most preferred locations by

considering her personal preferences, spatial information, time factor and social relations, etc [20,

52, 66, 126]. A few works focuses on recommending travel packages consisting of a collection of

POIs, instead of independent POIs [33, 68].

Sequential Locations Pattern Mining/Recommendation/Planning

While the above topic predicts or suggests isolated or a collection of POIs, another emerging topic

in recent years - sequential location recommendation/planning/pattern mining - focuses on POI se-

quences/routes. A stream of works in this topic emphasizes on discovering or retrieving the frequent

sequence patterns from trajectory databases, geo-tagged semantic spatial objects or data collected

by location acquisition sensors, which is useful in understanding the community/group preferences

on movement or detecting the occurrence of events, etc [133, 116, 121, 131].

Another array of works recommend or plan a sequence of locations (POIs), e.g., A → B →C → D that match a user’s personal interests. Different from recommending a sequence of item

to buy/view in traditional recommender systems, sequential location recommendation/planning has

to consider the geographical closeness, reachability and other spatial constraints, and it has many

applications in real life, such as trip route recommendation, intelligent navigation, ride sharing,

intra-city delivery and augmented reality gaming, etc. There are different ways to model and solve

the problem, for example, sequentially predict or recommend the next location and finally plan a

route [53, 6], or globally find routes that maximize (optimize) user’s certain satisfaction [26, 34, 12].

A detailed review of the works in this topic will be presented in Section 2.2.2.

1.2 Proposed Research Problems and Contributions

Despite the intensive study of the social connectivity/interaction behaviors and mobility behaviors

on different topics, as summarized in Section 1.1.3 and 1.1.4, a mass of new research problems

in these areas keep emerging as the rapid development of social media. In this thesis, we mainly

study three research problems that have not yet been well studied. The first problem, Mining Non-homophily Social Ties, falls into the social tie analysis topic of social connectivity/interaction

behaviors analytics, while the rest two problems, Personalized Trip Recommendation MeetsReal-world Constraints, and Route Search with Personalized Diversity Requirement on POIs,

6

Page 18: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

fall into the sequential location recommendation/planning topic of mobility behaviors analytics.

Essentially, all the three problems are studying the interaction behaviors of users. While the first

problem studies the group level interactions among users, the second and third problems focus on

moving interactions between a user and locations. They are all exploring network data with rich

attributes/feature information on nodes and edges. Besides, all of the three problems come from

real-life applications and have tremendous social and economical value. We introduce each prob-

lem respectively as follows.

1.2.1 Mining Non-homophily Social Ties

Femalesrespondingtomales

Asian

Black

Latino

White

Asian

Black

Latino

White

FEMALE MALE

9.3%

6.7%

Asian

Black

Latino

White

Asian

Black

Latino

White

MALE FEMALE

Malesrespondingtofemales

Highestresponserates

Figure 1.2: Response patterns in the Facebook dating app, Are You Interested1. The numbers repre-sent the percentage of people responding to a “yes” on the app, by gender and ethnicity.

The popular social networks usually possess a massive number of users and support a great many

applications. For example, Facebook has 2.13 billion monthly active users as of the fourth quarter of

2017 [72]. Large quantities of user demographic data and relationship data are associated with these

users. A study on the Facebook dating app, Are You Interested (AYI), has found some surprising

results by analysing users’ demographic data and the response rates between people, as shown in

Figure 1.2. The results are explained as “all except black women preferred white men, while all men

except Asians preferred Asian women” (we only introduce this example for problem motivation,

should not involve in any racial prejudice problem). Such frequent patterns of connections in social

networks, concisely in terms of attribute information of nodes and edges, indicate specific common

social interactions. We call this kind of patterns “group social ties”.

It is exciting if we can find all the frequent and surprisingly interesting group social ties as

above. But in reality, most frequent social ties are homophily social ties such that they follow from

the homophily principle that similar individuals, with common characteristics such as race, age,

education, are more likely to connect each other. Such homophilic social ties are usually well-

1http://huffingtonpost.com/jenny-davis/race-online-dating_b_4449946.html. Figure1.2 is reproduced based the original figure in the study.

7

Page 19: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

expected and people can easily dope out them without much effort, even though they are somewhat

useful. Therefore, discovering the popular non-homophily social ties that are not expected from

homophily is more interesting and brings extra values to businesses.

For instance, it is nearly a common sense that the dating between female graduates and male

graduates, i.e. the same education level, occur in a high chance. Based on this, we are more interested

in “if female graduates do NOT date male partner who are also graduates, which education level is

the secondarily popular target and how popular is this social tie?” We may find a surprising result

that if we exclude the “homophily effect” by restricting to the male partners NOT having graduate

education, the males having college education are the most popular target, which indicates a strong

preference beyond the homophily principle.

To this end, we propose the problem of Mining Non-homophily Social Ties. We design a novel

ranking metric called non-homophily preference (nhp) to identify strong non-homophily group

social ties, and we formulate the problem as mining k most interesting group social ties under the

nhp metric. We propose an effective and efficient approach for the problem and evaluate it on the

real-world social network and citation network data. The details will be presented in Chapter 3.

1.2.2 Personalized Trip Recommendation Meets Real-world Constraints

Undoubtedly, traveling on microscopical level plays an important role in one’s daily life and on

macroscopical level has great economic contribution to a city or even a country. According to

[96], the travel and tourism industry directly and indirectly contributed US$7.6 trillion to the global

economy and supported 292 million jobs in 2016. As the advancement of mobile devices and the

dramatic growth of publicly accessible location based data and services, an increasing number of

travelers immigrate from the traditional travel agency organized tours to self-guided or DIY tours.

For instance, the popular travel planner Google Trips2 automatically maps out a half-day or a full-

day with suggestions for things to see and do in a city. However, most existing such products only

suggest trips traversing famous places or user-selected POIs and ignore many realistic constraints.

This motivates us to develop an intelligent system that can suggest personalized trips to the travelers

fitting their specific tastes, moreover, meeting some realistic temporal-spatial constraints.

Consider the trip recommendation scenario shown in Figure 1.3, where each capital letter rep-

resents a POI with the personalized rating in the brackets, and the icon below each letter indicates

the type of a POI. Though A and B have the highest scores individually, the trip in blue, source→ B

→ A→ destination, is not feasible, because the POI B opens late and the traveling time of this trip

exceeds the user’s limited time budget. Then the green trip route, source→ A→ C→ destination,

can be a good choice. However, the road between A and C frequently congested, which makes the

completion of this trip within the given time budget very uncertain. In addition, the green route visits

two parks, if the user wants to visit at least two types of POIs, this route is not satisfying anyway.

2https://get.google.com/trips/

8

Page 20: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

source

destination

B (1.0)

A (1.0)C (0.9)

D (0.8)

Figure 1.3: Example of trip recommendation

Finally, we find that the trip route source → A → D → destination meets the user’s requirements

and the constraints the best.

Motivated by the above scenario, we propose the problem of Personalized Trip Recommenda-

tion Meets Real-world Constraints. The goal of the problem is to find an optimal trip that maximizes

user happiness, i.e., accumulated total personalized ratings, under the constraint that the POIs in the

trip covers at least certain number of different categories and they all can be visited during their

opening hours, and the trip can be completed within the user time budget with a probability not less

than a user specified threshold. This problem is NP-hard and challenging. We propose both optimal

and heuristic algorithms for it and evaluate all the algorithms on real life LBSN data sets. The full

details will be presented in Chapter 4.

1.2.3 Route Search with Personalized Diversity Requirement on POIs

In the above trip recommendation problem, a user’s overall rating on POIs are learned based on

users’ historical ratings. However, in many cases users usually have very sparse historical rating data

or even no historical rating data (new users), besides, users’ preference may dynamically change

over time. This motivates us to consider a more general on-demand application as below.

Instead of estimating a personalized rating for each POI, we assume that each POI is associated

with a vector of features (e.g., museum, park) with numeric or binary ratings, created from user

ratings and reviews on location-based services; and a user wants to be suggested a small number of

routes that not only satisfy her cost budget and spatial constraints, but also best meet her preferred

features and personalized route diversity requirements. In particular, she would like to explicitly

specify the exact features to be covered by the trip route, rather than the least number of features

as in the above problem. What is more, she may have a personalized “quantity” (the number of

POIs with a specified feature) and “variety” (the coverage of specified features) trade-off, i.e., when

visiting more POIs having the recurrent feature, e.g., visiting multiple parks in a single trip, the

satisfaction obtained by visiting each additional such POI for some users is constant while for the

other users decrease gradually (diminishing marginal utility).

To deal with the above requirements, we propose the problem of Route Search with Personalized

Diversity Requirement on POIs, which is to find top-k routes having highest values of a certain gain

9

Page 21: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

function for the set of POIs on the routes P , given a user’s source, destination and a travel cost

budget. The gain function is a weighted sum of the utility obtained for each feature, and the utility

for each feature is an aggregation of the feature’s scores of the POIs in P . To model the personalized

quantity and variety trade-off for the POIs on a route, we propose to adopt submodular functions

supporting the diminishing marginal utility property for aggregating the utility on each feature.

We propose elegant optimal and heuristic algorithms working for generic submodular aggregation

functions. The experiments on two real-world datasets show that our methods greatly outperform

baselines. We will present the details in Chapter 5.

1.3 Thesis Organization

The remainder of the thesis is structured as follows.

• In Chapter 2, we mainly discuss the related works to the three proposed research problems.

• In Chapter 3, we present the work Mining Non-homophily Social Ties, including the mo-

tivation, the modeling of non-homophily preference and the formal problem statements, as

well as the complete algorithm framework and comprehensive experimental study on both

the effectiveness and efficiency using real world social network and citation network data.

• In Chapter 4, we study the problem of Personalized Trip Recommendation Meets Real-world

Constraints, in particular, we present how to learn users’ personalized ratings for un-rated

POIs and the methods of recommending the best trip route to a user considering her start/end

location, time budget, the POI availability at the time it is visited, the uncertainty of traveling

time between POIs and the user’s desire of the least number of different categories of POI to

cover in a trip, etc and use real data to evaluate the methods.

• In Chapter 5, we describe the more general on-demand route search problem - Route Search

with Personalized Diversity Requirement on POIs in details. We motivate the needs of using

submodularity to model the personalized route diversity requirement on POI features with

concrete examples and discussion, and propose an elegant algorithm framework that works for

any user defined submodular Gain function and addresses the high computational complexity

issue in a unified way.

• Finally, we conclude the thesis with a summary of our contributions and point out some

potential future directions in Chapter 6.

• A list of our publications about the three proposed works is included in Appendix A.

10

Page 22: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Chapter 2

Related Work

We summarized the typical topics of exploring social connectivity/interaction behaviors and mo-

bility behaviors in Section 1.1.3 and 1.1.4 respectively from a bird’s eye view. In this chapter, we

specifically zoom-in to discuss the works related to the three proposed problems in Section 1.2.

Since the three problems fall into two different topics, we accordingly organize the related works

as the two topics, i.e., social tie mining covers the related works for our first proposed problem,

and route recommendation and search covers the related works for our second and third proposed

problems.

2.1 Social Tie Mining

Our first proposed problem is to identify the strong group social ties in terms of node and edge

attributes from social networks. The group social tie is a kind of graph pattern in social network;

thus, it is closely related to graph mining in social networks. In addition, according to [88], the

networks with node and edge attributes information are defined as information networks, therefore,

our problem is also related to the study on information networks. The social ties in our problem has

a similar form with association rule in transaction data, thus, we will also cover the discussion of

the relationships between our work and several works in association rule mining. We organize the

related works as these three related categories as below.

2.1.1 Graph Mining in Social Networks

The skeleton of the social media is social network, of which the essence is graph. Social tie mining

is related to some of the works applying graph mining in social networks. The related works in this

range can be categorized at the graph level, the local structure level, and the link level.

At the graph level, many previous works on graph mining in social networks focus on the simple

statistics of a large graph, such as degree distributions, hop-plots, clustering coefficients and number

of triangles. See surveys [15, 78]. Some works like [80] and [47] emphasize on jointly modeling

the network structures and node attributes with probability models. These summaries or modeling

11

Page 23: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

are useful for certain types of applications but are not aim to understanding how various groups of

actors interact with each other.

At the local structure level, a class of works summarizes a large graph by densely connected

subgraphs [54, 110], network motifs [77], and frequent sub-structures [51, 111]. A drawback of

structure-only patterns is the large number of patterns with no explanation on what kind of nodes

participates in the patterns, and it is hard to use such patterns for the applications with semantic

meaning. While majority of these works exploit only the topological structure of graphs, some works

also consider the node attribute information for community detection. For example, [112] develops

a probabilistic model to model the interaction between network structures and node attributes for

detecting overlapping communities. The motivation of community detection is quite different from

social tie mining.

At the link level, as we summarized in the Social Tie Analysis in Section 1.1.3, the works can be

categorized into individual social tie analysis and collective social tie analysis based on whether they

are studying individual behavior exhibited by a single (pair of) user or collective behavior observed

when a group of users behave together. Typical tasks of individual social tie analysis include link

formation that study the temporal evolution pattern of links [58], link prediction that aims to predict

future new links, deleted links or currently unobserved links between nodes in a network [93, 70],

and social tie strength analysis that measure the strength of a tie by the frequency of interactions or

other metrics leverage the results in other applications, e.g., recommendations [101]. The collective

social tie analysis more focuses on widely observed group-level interaction patterns or aggregated

results for large groups. [29] does attribute correlations study and aims at discovering users’ social

strategy by studying the interrelation between node attributes, like age and gender, and apply the

patterns to infer user’s demographics, which serves different purpose from our problem, and the

study on social structure of Facebook networks in [95], which focuses on calculating the propensity

for two nodes with the same categorical value to form a tie.

Our problem falls into the category of collective social tie analysis at the link level and has some

similarity with [95]. While their work can be used to quantify and specify homophily attributes in

our problem, our focus is on searching for unexpected ties that do not follow from homophily, which

has never been done previously.

2.1.2 Information Network Analysis

This body of works studies information networks where each node and (or) each edge belongs to

a specific object type, and node/edge may have descriptive attributes, weights, like ours. Almost

all the tasks in networks without considering node and edge attributes have similar applications in

information networks, including the clustering, ranking, classification, community detection, infor-

mation diffusion, recommendations, etc. However, most of the methods cannot be directly used to

solve the similar problems in information networks when the node/edge attributes are considered,

especially in heterogeneous information networks. [89, 117] summarize a set of topics for mining

12

Page 24: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

information networks and present the details of solving the traditional problem in the context of

heterogeneous information networks.

There are also some research problems that are specific in information networks. For example,

some works abstract the information network on meta level, as the types of node and link, and focus

on meta-path-based similarity search [90] and relationship prediction [87] in heterogeneous infor-

mation networks, where a meta-path is a structural path consisting of a sequence of relations defined

between different object types. They explore the network meta structures to better understand the

semantic meaning of the objects and relations, which has a similar angle to our group social ties in

terms of attribute information. But their tasks are totally different from ours.

Some other works aim to summarize the entire information networks in an OLAP-style, in-

cluding the Graph Cube [127] that allows the user to aggregate nodes and edges by rolling-up or

drilling-down attributes, and the k-SNAP operation [94] that generalizes a graph into k groups to

maximize the interestingness of pairwise relationships between the k groups. The follow-up work

[125] automatically categorizes numerical attributes values by exploiting the domain knowledge

hidden inside the node attributes values and the link structures and proposes an interestingness mea-

sure for graph summaries to point users to the potentially most insightful summaries. These graph

summarization works have some similarity to our work. A limitation of such graph-wise approach

is that interesting relationships are hidden in the aggregated graph that is a generalization of both

interesting and non-interesting relationships. The other stream of work, like [18], aggregates multi-

ple graphs into a summary static graph using their OLAP methods to find the patterns like “Top-10

central Authors” in multi-dimensional view, e.g., time and venue. Our focus differs from the above,

i.e, we aim at identifying strong non-homophily group relationships that exist for certain groups of

nodes and certain types of edges.

The graph iceberg [61] identifies iceberg vertices for which some attribute value, such as prod-

uct purchase or network attack, in their vicinities is abnormally high, which is also different from

discovering strong relationships between generalized node groups as in our work.

2.1.3 Association Rule Mining

Association rule mining from transaction data has been extensively studied in early years [1, 40].

The support and confidence framework was first introduced in such works. Mining frequent combi-

nations of attribute-values in a relational table was studied as iceberg cube queries [8]. [103] pro-

poses “self-sufficiency" to measure the interestingness of itemsets. Multi-relational data mining [27]

generalizes frequent patterns by allowing multiple predicates and variables in a pattern. These works

deal with the relational data model where records are either isolated or linked to a fact table through

foreign key references, but they do not consider the issues associated with social networks. The

homophily of social networks requires reconsideration of interestingness metrics and new strategies

of pruning. [100] studies mining unexpected rules based on prior knowledge where unexpectedness

is measured by similarity between fuzzy terms. Such non-statistical rules cannot be used for social

network applications that motivate our work. Our non-homophily preference is a statistical measure

13

Page 25: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

by excluding the homophily effect and captures the notion of conditional probability, which is good

for inference.

2.2 Route Recommendation and Search

Our second and third proposed problems aim at recommending or searching for the best (trip) routes,

i.e., a sequence of POIs, that satisfy certain user requirements and temporal-spatial constraints.

Clearly, it is most related to the works in sequential location recommendation and planning. Before

constructing the best route, we need to first select the preferred locations, thus, stand-along loca-

tion recommendation or search is also related to our works. Trajectory retrieval or pattern mining

retrieves existing paths and does not have the procedure of recommendation or construction of trip

route, but the output format is similar, thus is also related. On theoretical level, the route recom-

mendation and search problems are the variants of the Orienteering Problem or the Travelling Sales

Man Problem in the field of operation research; hence, we will also include a discussion about the

relationship to this topic. We categorize the related works as the following specific topics.

2.2.1 Stand-alone Location Recommendation/Search

Many location-based recommendation works in recent years fall into this category of POI recom-

mendation, which scores each POI individually and recommends top-k POIs to a user. Some of the

works [20, 52, 66, 41] treat location as an item with additional spatial features and adapt the meth-

ods used for the traditional recommender systems, or suggest the POIs to be visited next considering

the user’s current visiting location, time factor and social relations, etc [126]. They either consider

no content information, e.g., the features on POIs, or consider content as side information when

making the recommendation, few of them treat feature as the central role in collaborative filtering.

Some other research works go one step further and they focus on recommending travel packages

consists of a collection of POIs. For example, [108] solves a problem of top-k package recommen-

dation by modeling it as the knapsack problem, [33, 68] develop probabilistic models to generate

possible packages by considering cost, season, area, etc. These works are still a subset of POI rec-

ommendation. The key difference between (trip) route recommendation and POI recommendation

is that POI recommendation suggests stand-along POIs instead of a sequence of POIs, thus, it con-

siders neither the order of visiting POIs nor the constraints such as the time budget of users, the POI

availability and road conditions.

Another array of works fall into the field of keyword-aware location or spatial object search. [13]

and [14] retrieve top-k or a group of spatial web objects using carefully designed spatial-keyword

index structures. [5] identifies a location on any user-selected edge, such that there exists a set of

objects covering all query keywords and the total distance between this identified location and the

set of objects is minimized. [129] solves the problem of k-nearest neighbor search by keywords in

a continuous manner as the traveler moves. All these works aim at searching for stand-alone loca-

tion(s) that satisfy certain keywords requirement; the objective is not a sequence as in our problems.

14

Page 26: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

2.2.2 Sequential Location Recommendation/Planning

While the above topic suggests or searches for isolated or a collection of locations, this body of

works suggests or plans a POI sequence or a location path, e.g., A→ B → C → D with each letter

representing a POI, to a user that best matches the user’ interests or requirements. Also different

from recommending a sequence of item to buy/view in traditional recommender systems, sequen-

tial location recommendation/planning has to consider the geographical closeness, reachability and

other spatial constraints, and it has many applications in real life, such as trip route recommendation,

intelligent navigation, ride sharing, intra-city delivery and augmented reality gaming, etc.

There are several branches of this topic. One branch of works recommends routes by directly

adapting existing routes or partial routes of other uses. For example, [116] recommends itineraries

from user-generated digital trails, [104] constructs a route that sequentially passes the provided

locations within a time span by splicing multiple segments of retrieved uncertain trajectories, [25]

recommends personalized driving routes from other drivers’ trajectories while considering driver’s

travel cost preference.

Some other works fall into another stream that plans or recommends routes stepwise and makes

decision of the next location to visit at each step. For instance, [53, 21] learns probabilistic mod-

els, i.e., Markov models, from users’ historical traveling behaviors and interests, and use the learnt

model to sequentially predict the next locations based on the current location and finally yield a

route; [3] requires user interaction to manually select a POI from each desired type; and [6] interac-

tively plans a route in steps and use user feedback or selections to improve results.

Yet another trend of works that is closely related to our proposed second and third problems, e.g.,

[26, 34, 69, 12, 120], leverages location based social media to gather the information for POIs and

the edges between POIs, then globally finds POI sequences that maximize user’s certain satisfaction

while meeting some subjective and objective constraints. The works may consider the popularity of

locations, keywords and features on locations/roads, traveling costs, time window of locations, etc.

These problems are generalized as a constrained optimization problem as follows

P∗ = arg max f(P), s.t. C(P), (2.1)

where P is a route consisting of a sequence of locations and the connecting edges in the road

network and P∗ is the found best route, f is the objective function to be optimized, and C(P) spec-

ifies one or a set of constraints on route P . For example, one such common constraint could be

cost(P) ≤ B, i.e., the cost of a route should be no larger than some cost budget B. The problems

involved in these works are often NP-Hard, as they can be generalized to either the Travelling Sales-

man Problem or the Orienteering Problem (will be introduced shortly in Section 2.2.4). Dynamic

Programming, Branch-and-Bound, etc are commonly used techniques for designing algorithms that

produce optimal results with relatively high computational cost. Heuristic algorithms with less com-

putational costs yield approximation results. We compare with the following works in this category.

15

Page 27: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

[34] assumes that each POI belongs to a single type or category and searches for a route with

POIs following a pre-determined order of types. As this work does not consider user specific pref-

erences, thus, generates the same itinerary to all users. [109] supports only a single type of POIs

and the fixed traveling time between POIs. [26] leverages user’s historical photo streams to estimate

personalized ratings and stay time on POIs, and presents an approximate solution for constructing

travel routes. [69] adopts memory-based collaborative filtering to estimate temporal-based user pref-

erences that are dynamic with time, but ignores features on POIs. [65] mainly focuses on modeling

the queuing time on POIs. All these works do not consider the POI availability and uncertain time

constraints, as considered in our proposed second problem. Besides, they either ignore the features

on POIs, or consider a fixed order of POI types, none of them consider user specifically preferred

features and personalized diversity as in our proposed third problem.

Several works consider the keywords or features on POIs while optimizing a route. [62] treats

each POI covering the same keyword equally and maximizes the number of keywords covered by

a route given a distance threshold. [12] constructs an optimal route covering user-specified cate-

gories of locations, assuming that each POI with a specified keyword fully meets user’s need on this

keyword and optimizing some objective function on all edges in a route, such as travel distance or

popularity of edges. Such “all or nothing” feature modeling cannot address general route diversity

requirement for modeling user’s quantity and variety trade-off as considered in our proposed third

problem. [120] adopts a keyword coverage function to measure the degree to which query keywords

are covered by a route, similar to our proposed third problem. But their methods are designed for

their specific keyword coverage function; thus, this work does not address the personalized route

diversity requirement, where a different submodular function may be required.

2.2.3 Trajectory Retrieval and Patterns Mining

This body of works emphasizes on retrieving the desired existing trajectories or discovering the

frequent sequence patterns from trajectory databases, geo-tagged semantic spatial objects or data

collected by sensors, which is useful in understanding the community/group preferences on move-

ment or events detection, etc. For example, [133] mines interesting location and classical travel se-

quence patterns from GPS trajectories; works like [2, 121] mine sequential patterns from semantic

trajectories or geo-tagged photos; [130] and [128] study the similarity query in semantic trajectories

to retrieve existing (segments of) trajectories that contain the most relevant keywords and yield the

least travel distance; [131] uses indexing and pattern detection to discover gathering patterns, i.e.,

large congregations of individuals, from trajectory database.

2.2.4 Operation Research and Scheduling

The classic Orienteering Problem (OP), such as [35] [98] [17], studied in operation research on

theoretical level, finds a path, limited in length, that visits some nodes and maximizes a global

reward collected from the nodes. OP is a specialized instance of the Travelling Salesman Problem

16

Page 28: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

[76] with budget constraint. OP has a similar objective to that of the generalized constrained optimal

route recommendation/search problem as in Eqn. (2.1), thus, is related to our proposed problems in

this topic. However, there are some important differences. First, OP does not consider personalized

user preferences or features so only a global trip is planned. Second, OP has no touring time for

each location, which is an important factor affecting the number of POIs visited. Finally, while

most of the studies on OP mainly concern the theoretical computation complexity of the problem or

its variants, they ignore the constraints or requirements that are in fact practical in real world (trip)

route recommendation and search applications, such as the uncertain traveling time between POIs,

the opening hours on POIs and the personalized diversity requirement on features of a route, etc.

Compared to OP, Arc Orienteering Problem (AOP) [71] associates the utility with edges instead of

nodes. Our proposed route recommendation/search problems generalize AOP since edge utility can

be modeled by inserting a dummy POI on each edge.

Some other works on real life scheduling problem are related to our modeling of the uncertain

traveling time. [10] consideres multiple types of transport within a single trip and adopted Monte-

Carlo simulation to estimate the probability of catching the trip in non-deterministic transport net-

works. [106] introduces a Bayesian model to estimate the distribution of ambulance traveling time.

17

Page 29: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Chapter 3

Mining Non-homophily Social Ties

3.1 Motivations and Contributions

Grad College

Asian Latino White

High School

3 1 2

12 11 M

F

8 9 10

5 4 7

13

6

14

(a) Network topology (patterns represent Race and shapesrepresent Education)

ID SEX RACE EDU1 F Asian Grad2 F Latino Grad3 F White Grad4 F Asian College5 F White College6 F Asian High School7 F Latino High School8 M Asian Grad9 M Latino Grad

10 M White Grad11 M Latino College12 M White College13 M Asian High School14 M White High School

(b) Attributes on nodes

Figure 3.1: A toy dating network with attributes on nodes

Social networks are heterogeneous and multidimensional [127] in that nodes and edges belong

to certain classes and each class has description on multiple attributes. For example, in addition

to the different types of relationship, each user in Facebook has a profile that reveals detailed per-

sonal information. According to [102], social ties are links that connect social actors, and are seen

as “channels for transfer or flow of resources (either material or nonmaterial)”. Discovering the

group-level interesting social ties concisely in terms of attribute information of nodes and edges,

holds a key to the understanding of how the actors interact with each other and form relationships,

which is useful in user behavior analysis and modeling, friends/items recommendation, inferring

user demographics, etc.

To illustrate, consider a toy online dating network in Figure 3.1. A dyadic tie is a dating relation-

ship and each individual has attributes SEX, RACE, and EDU. We represent a group of ties between

two groups of individuals by a group relationship or GR, denoted l w−→ r. l and r are the attributes

18

Page 30: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

information describing the two groups of nodes and w is the attributes information describing the

edges between the groups. GRs serve as the representation of group social ties.

Example 1. A pattern similar to the finding in the AYI app study as mentioned in Section 1.2.1,

i.e., “all men except Asians preferred Asian women”, can be represented by GR1 and GR2 in the

following table.

GR1(SEX:M)→ (SEX:F, RACE:Asian)supp = 7/15; conf = 7/14

GR2(SEX:M, RACE:Asian)→ (SEX:F, RACE:Asian)supp = 0; conf = 0

In GR1, the edge descriptor w = dates is omitted. supp = 7/15 and conf = 7/14 are two

intuitive metrics, support and confidence, originally used for association rule mining [1]. supp =7/15 means that 7 out of the 15 links are involved in this relationship, and conf = 7/14 means that

7 out of the 14 links originating from the nodes for male go to the nodes for Asian women. GR2

and GR1 together suggest that while most men preferred Asian women, Asian men are an exception.

This finding could be interesting to a dating service provider.

The frequent and surprisingly interesting GRs like GR1 are practically useful. However, in real-

ity most frequent GRs follow the homophily principle, or “love of the same” [75]: a contact between

similar people occurs at a higher rate than among dissimilar people, where similarity is measured

by certain common characteristics such as beliefs/religion, value, race, age, etc. That is, GRs that

are expected from the homophily principle usually tend to have a high confidence, and such GRs

generally have the form lw−→ r where the values in r occur in l. In this work, we assume that the

homophily principle is known, and our goal is to find the GRs that are popular and interesting, but

are not simply expected from homophily. Example 2 illustrates that such GRs can be potentially

useful but they are not ranked high by confidence.

Example 2. Consider the two GRs, GR3 and GR4, listed in the table below.

GR3(SEX:F, EDU:Grad)→ (SEX:M, EDU:Grad)supp = 4/15; conf = 4/6

GR4(SEX:F, EDU:Grad)→ (SEX:M, EDU:College)supp = 2/15; conf = 2/6

Assume that the attribute EDU follows the homophily principle. Therefore, GR3 likely has a high

confidence but is not interesting because it is expected from the homophily principle. GR4 likely has

a low confidence since GR3 has a high confidence. supp and conf are obtained from the data in

Figure 3.1. A closer inspection of the data reveals that if a female with Grad education does NOTwant her partner to have Grad education, i.e., exclude the “homophily effect” by restricting to the

male partners not having Grad education, there is a high chance that she prefers a partner with

College education and the chance is supp(G4)/(supp(SEX : F, EDU : Grad) − supp(G3)) =2/(6 − 4) = 100%. This preference of College education, which is conditioned on the educations

other than Grad, could be interesting to the dating service provider.

19

Page 31: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

This observation motivates a new ranking metric, called non-homophily preference, which

will be formally defined in Section 3.2. Intuitively, non-homophily preference captures “secondary

bonds” beyond the “primary bonds” of homophily. One contributing factor of secondary bonds is

heterophily [82], i.e., the tendency of individuals to collect in diverse groups. It was shown in that

heterophilious networks are better to promote and spread innovations [82]. Thereby, though the

primary bond is important in multiple applications, exploring the secondary bond can result in more

interesting findings and bring extra value to many businesses. The next example further explains

this point.

Example 3. To leverage social influence for promoting products, an obvious strategy for a financial

institution is to use GRs following from homophily, such as

(JOB : Lawyer, PRODUCT : Stocks)→ (PRODUCT : Stocks)

to promote Stocks to the friends, f, of existing customers who are lawyers and have bought Stocks

(on LHS). This effort fails if most such friends f already bought or do not like Stocks. On the other

hand, suppose

(JOB : Lawyer, PRODUCT : Stocks)→ (PRODUCT : Bonds)

has a high non-homophily preference, that is, among the friends f who do not buy Stocks, many buy

Bonds. This GR can be used to promote Bonds to a friend if he/she has not bought Bonds, and the

high non-homophily preference implies a high adoption rate.

Indeed, many companies have both e-commerce services and social network services, enabling

them to create information networks to mine GRs for economic benefits. For example, Alibaba

Group1 provides various sales services, and has the instant messenger Aliwangwang that builds the

social network among customers and vendors. As another example, Facebook Platform2 allows a

third party business to build application based on their platforms. This tool enables integrating the

social graph with the customer information owned by the third party business, and applications on

facebook.com are allowed to access the graph.

A detailed review of related work is presented in Section 2.1. Among them, a body of works,

such as [127] [94], considered information networks and focused on summarizing the entire graph.

Whereas we have a different focus of identifying strong relationships that exist for certain groups

of nodes and certain types of edges. Our problem has some similarity with [95], which focuses on

calculating the propensity for two nodes with the same categorical value to form a tie in Facebook

networks. While their work can be used to quantify and specify homophily attributes in our problem,

1http://en.wikipedia.org/wiki/Alibaba_Group

2http://en.wikipedia.org/wiki/Facebook_Platform

20

Page 32: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

our focus is on searching for unexpected ties that do not follow from homophily. To the best of our

knowledge, the non-homophily preference in social ties were never studied previously.

Contributions

In summary, we make the following contributions.

• We propose a novel ranking metric called non-homophily preference (Section 3.2.2) to iden-

tify strong group social ties beyond the homophily principle; we define the problem of min-

ing top-k GRs (Section 3.2.3) to extract k most interesting group social ties under the non-

homophily preference metric.

• The search space of top-k GRs is large due to multidimensional nodes and edges and the lack

of usual anti-monotonicity of non-homophily preference. We first propose a compact data

structure to store the multidimensional nodes and edges information in social networks (Sec-

tion 3.3), then we present a novel search strategy to enable a new form of anti-monotonicity

for non-homophily preference (Section 3.4). This strategy ensures that only non-trivial GRs

that meet a minimum requirement on support and non-homophily preference are enumerated.

• We present an efficient top-k GRs mining algorithm, GRMiner, based on the new data struc-

ture and the above search strategy (Section 3.4.4).

• We evaluate our approach on two real world social network and citation network data sets

(Section 3.5), and provide potential extensions of our framework (Section 3.6).

3.2 Problem Statements

A social network is represented by a graph G = (V, E) as in Section 1.1.1. Each node in V and each

edge in E has descriptions over a fixed set of node/edge attributes. Each attribute A has a discrete

domain 0, 1, · · · , |A|, where |A| is the domain size (a.k.a. cardinality), with 0 representing the

null value. We consider directed edges; an undirected edge can be represented by a pair of directed

edges in the opposite directions. A subset of nodes of V that share same values a on some node

attributes A can be represented by a set of pairs (A : a) called a node descriptor. For example,

(SEX:F, JOB:IT) represents all the nodes having the values (SEX:F, JOB:IT). Similarly, a subset

of edges in E can be represented by a set of pairs (A : a) called an edge descriptor. Table 3.1

summarizes the main notations used in the work.

3.2.1 Group Relationships

Definition 1. [GR] A group relationship (GR) has the form lw−→ r, where l and r are node descrip-

tors and w is an edge descriptor. l is called LHS. r is called RHS. L,W ,R denote the attribute sets

for l, w, and r, respectively.

21

Page 33: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Table 3.1: Frequently used notations

Notation InterpretationG(V, E) graph with the nodes V and the edges El, r, w three parts of attributes values in a GR l

w−→ r

LRW

attribute sets for l, r, and w, respectively

Al,Ar Al is an attribute in L, and Ar is Al inRl

w−→ l[β] homophily effect, see Eqn. (3.5)

GRs serve as the representation of group social ties. For GR3 in Example 2, l = (SEX:F,

EDU:Grad), w = (TYPE:dates), and r = (SEX:M, EDU:Grad). This GR says that females with Grad

education tend to prefer male partners with Grad education. The “tendency” can be measured by

support and confidence [1].

Definition 2. [Support] Support of l w−→ r indicates the probability that an edge satisfies all the

conditions in l ∧ w ∧ r:

supp(l w−→ r) = P (l ∧ w ∧ r) = |E(l ∧ w ∧ r)||E|

. (3.1)

|E(l ∧ w ∧ r)| denotes the number of edges satisfying l ∧ w ∧ r. Support of l ∧ w is defined as

supp(l ∧ w) = P (l ∧ w) = |E(l ∧ w)||E|

. (3.2)

With |E| being a constant for a given network, we can use absolute support by dropping the

denominator |E|. While support measures the generality of a GR, confidence measures the strength

of a GR.

Definition 3. [Confidence] Confidence of l w−→ r is defined as

conf(l w−→ r) = P (r | l ∧ w) = supp(l w−→ r)supp(l ∧ w) . (3.3)

3.2.2 Non-homophily Preference

Perhaps a simple approach to identify the interesting GRs is specifying a minimum threshold on

support and a minimum threshold on confidence. However, many GRs that have a high support and

a high confidence, like GR3, often are well expected because of the homophily effect, and those that

do not follow from homophily but are still interesting, like GR4, are missed due to a low confidence,

unless we set the thresholds for support and confidence at a very low level, which leads to a much

larger search space. In this work, we are interested in GRs that are not expected from the homophily

principle. Therefore, the confidence metric is not suitable for our purpose and we need a new metric

to identify interesting GRs. First, let us clarify the notion of homophily.

22

Page 34: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Homophily attributes. Intuitively, a GR is considered to follow from the homophily principle if

the LHS and RHS of this GR share a (set of) value(s). However, the homophily is attribute sensitive

so that sharing values on certain attributes are not considered as trivial, such as the attribute SEX

in a dating site. We differentiate an attribute as either a homophily attribute or a non-homophily

attribute. An homophily attribute refers to an attribute on which the individuals sharing the same

value are more likely to connect to each other. For a given social network, we assume that the

setting of homophily attributes is specified. Some existing works, like [95], studied the methods

to identify homophily attributes. In many cases, homophily attributes are known from a common

sense. For example, EDU is likely a homophily attribute for dating relationships whereas SEX is a

non-homophily attribute since dating could be between two people of same or opposite sex.

To capture and rank the GRs not expected from homophily, we propose to exclude the ho-

mophily effect from confidence. In general, let Al denote an attribute A in L, and Ar is the same

attribute A inR (L andR are defined in Definition 1), for a GR lw−→ r, let β denote the homophily

attributes inR that occur in L but have different values in the two sides, i.e.,

β = Ar ∈ R | A is a homophily attribute, Al ∈ L and r[Ar] 6= l[Al]. (3.4)

Then l[β] denotes the part of RHS containing the values in l restricted to β. We define homophily

effect as

lw−→ l[β]. (3.5)

Consider GR4, (SEX:F, EDU:Grad) dates−−−→ (SEX:M, EDU:College), in Example 2. Assume that EDU

is a homophily attribute while SEX is not. The values for EDU on both sides are different, thereby,

β = EDU , and the homophily effect l w−→ l[β] is (SEX:F, EDU:Grad) dates−−−→ (EDU:Grad). Recall

conf (l w−→ r) = supp(lw−→r)

supp(l∧w) . We can exclude the homophily effect by subtracting supp(l w−→ l[β])from the denominator supp(l ∧ w) in the confidence. This gives rise to the following new metric.

Definition 4. [Non-homophily Preference] The definition of non-homophily preference of a GR

lw−→ r is given by

nhp(l w−→ r) = P (r | l ∧ w ∧ ¬l[β]) = supp(l w−→ r)supp(l ∧ w)− supp(l w−→ l[β])

. (3.6)

Intuitively, nhp(l w−→ r) is the conditional probability of links going to a node described by r,

given that they satisfy l ∧ w and do not go to a node described by l[β]. For GR4, its confidence is

given by supp(GR4)/supp(l ∧ dates) = 2/6. 4 of the support supp(l ∧ dates) = 6 is contributed

by the homophily effect. Excluding this effect from supp(l∧dates) = 6, nhp(GR4) = 2/(6−4) =100%, read as: for women with Grad education, when not dating men having Grad education, they

were dating men having College education with 100% probability.

Remark 1. In the case of β = ∅, the edges due to the homophily effect do not exist, we define

supp(l w−→ l[β]) = 0; consequently, nhp degenerates to conf. Hence, conf is a special case of nhp

23

Page 35: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

where there is no homophily attribute. In the case of β 6= ∅, nhp ≥ conf, so excluding the homophily

effect boosts the rank of a GR not expected from homophily. This is exactly what we want to achieve.

Theorem 1. Assume supp(l w−→ r) 6= 0. (i) The denominator in Eqn. (3.6) is not zero. (ii) nhp ∈[0, 1].

Proof. (i) If β = ∅, the denominator in Eqn. (3.6) is equal to supp(l ∧ w), which is not equal to 0.

Assume β 6= ∅. Suppose supp(l ∧w)− supp(l w−→ l[β]) = 0, i.e., supp(l ∧w) = supp(l w−→ l[β]),

this implies that all edges satisfying l ∧ w go to the nodes covered by l[β] and no edge goes to the

nodes covered by r, i.e., supp(l w−→ r) = 0. But this contradicts the assumption.

(ii) If β = ∅, the denominator in Eqn. (3.6) is equal to supp(l ∧ w), so nhp has a value in the

range [0, 1]. If β 6= ∅, it suffices to note that the links accounted for by supp(l w−→ r) and the links

accounted for by supp(l w−→ l[β]) are disjoint (because r and l disagree on β), and both are subsets

of those accounted for by supp(l ∧ w).

3.2.3 Top-k GRs Problem

We say that a GR lw−→ r is trivial if all of the values in r are from homophily attributes and r ⊆ l.

A trivial GR is expected from the homophily principle, so we are only interested in non-trivial GRs.

Among the non-trivial GRs, some are interesting to users while some are not, we can use a threshold

of support and non-homophily preference to select the interesting ones. Furthermore, for two GRs

g1: l1w1−→ r1 and g2: l2

w2−→ r2, if l1 ⊆ l2, w1 ⊆ w2, and r1 = r2, we say that g1 is more general

than g2, and g2 is more special than g1. Intuitively, if g1 is more general than g2, g1 is a similar

tendency to g2 but covers more nodes on LHS. In this case, if both g1 and g2 satisfy certain support

and non-homophily preference thresholds, g1 would make g2 redundant.

On account of the above discussion, finding the k most interesting GRs offers a brief and valu-

able overview of the entire social network. Hence, this problem is formulated as follows.

Problem 1. [Top-k GRs Problem] Given the homophily settings for attributes, a support threshold

minSupp, a non-homophily preference threshold minNhp, and an integer k, a non-trivial l w−→ r

is a top-k GR if the three conditions hold:

• (1) supp(l w−→ r) ≥ minSupp and nhp(l w−→ r) ≥ minNhp;

• (2) no non-trivial GR is more general than l w−→ r while satisfying (1);

• (3) no more than k − 1 non-trivial GRs have a higher rank while satisfying (1) and (2),

where the rank is measured by non-homophily preference, followed by support, followed by

the alphabetical order of GRs.

The objective is to mine the top-k GRs.

Similar to setting the minSupp threshold for mining frequent itemsets, there is no really uni-

form way to determine the best thresholds for minSupp and minNhp and the value of k. The

24

Page 36: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

setting of these parameters depends on specific application and data, and is usually done by trial

and error. Typically, smaller thresholds of minSupp, minNhp and a larger k are used to generate

a superset of patterns to avoid losing interesting ones, and then the user can explore these patterns

to find more interesting ones and conduct more advanced analysis. In general, not all the top-k

GRs must be very interesting and have direct applications; the top-k GRs are usually intermediate

nuggets and not the end results or knowledge, but they provide an important start point of analysis.

3.2.4 Challenges

We first analyse the complexity of Problem 1. Let #AttrV denote the number of attributes in nodes

V and #AttrE denote the number of attributes in edges E . So the total number of attributes on

nodes and edge is n = 2 ×#AttrV + #AttrE . To find the top-k GRs, in the worst case, we have

to traverse all the subsets of all the attributes, and for each attribute, we have to enumerate all the

values for it. For the sake of analysis, we assume all the attributes on nodes and edge have identical

domain size |A|. Therefore, the total number of GRs to search in Problem 1 is as many as:

n∑i=0

(n

i

)|A|i = (1 + |A|)n. (3.7)

One baseline algorithm for finding top-k l w−→ r is to apply regular Apriori-like algorithms such

as [1] to find frequent sets l∧w and l∧w∧r above the minSupp threshold and then construct GRs

in a post-processing step using the minNhp threshold. However, strong social ties, with high nhp,

typically exist among small groups, i.e., with a relative small support, and the regular Apriori-like

algorithms do not work well for GRs with a small support, because there are too many frequent sets

when minSupp is small. Another issue is that frequent set mining usually requires collecting all

information in one table. For graph data, this means replicating the node information for every edge

adjacent to the node, and the size of this table is |E| × (2 × #AttrV + #AttrE). The term |E| ×2×#AttrV usually causes storage explosion and imposes a bottleneck for most graph algorithms,

especially for high dimensional nodes with large #AttrV and densely connected graphs with large

|E|.Another straightforward approach is to use a threshold for standard confidence (as defined in

Definition 3), minConf, andminSupp to mine all the GRs that satisfy these two criteria, then remove

the trivial (homophilic) GRs in a post-processing phase to get the final results. This approach has

the following drawbacks. First, as discussed in Section 3.2.2, the confidence metric favors GRs that

follow from the homophily principle so that the majority of the high-confidence GRs in the top-k

results are trivial (we will show this in Table 3.2 and 3.3). Thus, many non-trivial and interesting

GRs are not returned because either their conf are less than minConf or they are not ranked within

the top k. As a result, this algorithm has to set a very small minConf and very large k to first let the

non-trivial GRs be returned before post-processing. By doing this, the efficiency becomes terrible

25

Page 37: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

due to the computation of the huge number of trivial GRs. Second, the post-processing for the great

number of trivial GRs is another cost, which makes this approach rather worse.

An ideal algorithm is that it examine only the necessary GRs and return the top-k results in

one phase. To achieve this and address the issues mentioned above, the key is to push the minNhp

threshold, in addition to the minSupp threshold, as early as possible. Besides, storing edge and

node attributes information separately without duplication helps a lot. We first introduce the data

structure of representing the social networks that contain edge and node attributes information in

Section 3.3, then we mainly focus on a new search strategy for pushing the minNhp threshold and

present the full implementation of our algorithm for mining top-k GRs in Section 3.4.

3.3 Data Structure

1 2 3 1

2 2 2 4

2 5 10 6

1 3 7 16

… … … …

Al Bl Out Ind ID Ar Br

W Ptr

5

Social Networks

LArray RArray

1

2

3

4

….

1 2

2 4

1 5

1 3

2 5

2 1

1 4

3 6

1 7

… …

1 2

2 2

2 5

1 3

… …

EArray

Figure 3.2: Data structure: LArray, EArray and RArray

For the sake of illustration, let’s consider two node attributes A,B and one edge attribute W .

For each node attribute A, we use the symbol Al for the occurrence in LHS of a GR and use the

symbol Ar for the occurrence in RHS. Then, we shall store the node and edge information of social

networks separately as shown in Figure 3.2.

LArray contains the records for individuals that could occur in the LHS of GRs and RArray

contains the records for individuals that could occur in the RHS of GRs. Out is the out-degree of a

record and Ind is the starting position of the outgoing edges in EArray. Edge records for each record

in LArray are disjoint, therefore, can be grouped as in the figure. EArray contains one record for

each edge and Ptr is the pointer to the record for the destination node in RArray. We assume that

this structure is held in memory and use it to partition the data for counting the support for GRs. For

example, the first row in LArray represents the record 1 for LHS, which connects to the destination

records 2, 4 and 5 for RHS, found by the pointers Ptr kept in the entries [Ind, Ind + Out− 1] of

EArray. Note that RArray and LArray are for destinations and sources of edges (thus, not a subset

of each other) and will be sorted by the different attributes for RHS and LHS for counting support.

For this reason, RArray and LArray must be separately stored. When partitioning the data by each

26

Page 38: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

attribute, we apply a linear sorting method, Counting Sort [23], to sort and aggregate to get the

support of each partition simultaneously. It sorts in O(N) time without any key comparisons.

This compact data structure has the size |V| × (#AttrV + 2) + |E| × (#AttrE + 1) + |V| ×#AttrV , which eliminates the bottleneck term |E| × 2×#AttrV of the single table representation

as mentioned in Section 3.2.4. The difference is usually large because #AttrV is typically much

larger than #AttrE and a node typically connects to, or is connected from, multiple nodes. Even

for a sparse network, the space requirement of the compact data structure is also smaller since the

nodes with zero out-degree of in-degree will not appear in LArray or RArray.

3.4 Mining Top-k GRs

3.4.1 Pruning Strategies

To prune GRs using a minimum threshold on nhp (Definition 4), the challenge is that, shown in The-

orem 2(2,3), nhp has anti-monotonicity only for “certain cases”; for the remaining cases, adding a

value for a homophily attribute to RHS would increase or decrease nhp, so the traditional tree-based

pattern enumeration cannot prune GRs using a threshold of nhp. See more discussion in Remark 2.

To deal with this issue, we devise a new enumeration, i.e., subset-first depth-first enumeration with

dynamic ordering of the homophily attributes (Section 3.4.2), to manifest the anti-monotonicity of

nhp in all cases (Theorem 3). This strategy allows us to prune GRs based on the threshold of nhp.

First, the next theorem states pruning properties of GRs.

Theorem 2. For a given GR l w−→ r, we can add an attribute value to l or r or w of this GR to form

a new GR with an additional attribute. (1) supp(l w−→ r) is not increased by adding an attribute

value to l or r or w. (2) If β 6= ∅ for l w−→ r, nhp(l w−→ r) is not increased by adding a value to r.

(3) If β = ∅, nhp(l w−→ r) is not increased by adding a value to r for a non-homophily attribute or

for a homophily attribute not occurring in l.

Proof. (1) follows from the anti-monotonicity of support. nhp is equal to supp(lw−→r)

supp(l∧w)−supp(lw−→l[β])

(Definition 4). Adding a value to r does not affect supp(l ∧ w), and if β 6= ∅, never increases

supp(l w−→ l[β]) and supp(l w−→ r). This shows (2). If β = ∅, adding a value to r for a non-

homophily attribute, or a homophily attribute not occurring in l, preserves β = ∅, thus, supp(l w−→l[β]) = 0. Then (3) holds similarly as in (2).

Remark 2. Theorem 2(1) enables supp based pruning of GRs, and Theorem 2(2,3) enables nhp

based pruning when expanding the RHS r of a GR under certain cases. The remaining case is

expanding a value to r for a homophily attribute that occurs in l when β = ∅. In this case, nhp does

not have the anti-monotonicity. To see this, suppose that we add a value br for a homophily attribute

B to r, where some value bl of Bl, bl 6= br, has already occurred in l. Before the addition, β = ∅,thus, supp(l w−→ l[β]) = 0, but after the addition, β 6= ∅ (see Eqn. (3.4)), so supp(l w−→ l[β]) 6= 0.

This change may increase or decrease nhp(l w−→ r), making nhp not anti-monotone.

27

Page 39: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

In the remainder of this section, we propose a careful order of enumerating GRs l w−→ r so that

GRs can be pruned based on nhp in all the cases of adding a value to r.

3.4.2 Subset-First Depth-First (SFDF) Enumeration

We propose to use a tree structure to represent all GRs. Each tree node represents a subset LWR(see Table 3.1 for the definition of L, W and R) and all the corresponding GRs l w−→ r. This tree

structure is only a conceptual representation and is not stored in entirety. The nodes of this tree are

enumerated to ensure two properties:

• Property 1: Enumerate a subset LWR by appending attributes in the order of those in L,W ,

and R. This order enables the pruning in Theorem 2(1,2,3) where for a GR lw−→ r the values for r

are added after those for l and w.

• Property 2: Enumerate a subset L1W1R1 before any of its supersets L2W2R2 where L1 ⊆L2,W1 ⊆ W2, and R1 ⊆ R2. This order ensures that the node for l w−→ l[β] is enumerated before

the node for l w−→ r (because β is a subset of R), hence, supp(l w−→ l[β]) was computed before

computing nhp(l w−→ r). This is necessary because the latter depends on the former.

The regular depth-first enumeration does not provide Property 2, and the regular breadth-first

enumeration (level order) meets these requirements but has to keep all nodes and their GRs at the

same level, which imposes a bottleneck on memory.

We propose a novel strategy, called subset-first depth-first (SFDF)3, that will (1) enumerate a

subset before a superset like the breadth-first enumeration but is depth-first to avoid the memory

bottleneck; (2) ensure Property 1, 2 and that each subset LWR is enumerated at most once. One

key idea of SFDF is to impose the following special reverse order of enumerating all attributes:

τ : NHr, Hr,W,NH l, H l (3.8)

whereNH l denotes non-homophily attributes for LHS, andNHr denotes non-homophily attributes

for RHS. Similarly, H l and Hr denote homophily attributes for LHS and RHS, respectively. W

denotes edge attributes. We will illustrate the effect of this enumeration order shortly.Figure 3.3 shows the SFDF enumeration of all subsets LWR with the order indicated by the

sequence numbers aside the nodes. For the sake of illustration, we assume that there are two node

attributes,A andB, both are homophily attributes (the enumeration of non-homophily attributes are

straightforward as discussed below), and an edge attribute W . Therefore, according to Eqn. (3.8),

NHr = NH l = ∅, Hr = Br, Ar, H l = Bl, Al, and τ = (Br, Ar,W,Bl, Al).

At any tree node t, let label(t) denote the labeling attribute for t, path(t) denote the attribute set

LWR constructed by all the labels for the nodes on the path from the root to t, and tail(t) denotes

the prefix of the list τ to the left of the attribute label(t). tail(t) is the set of unused attributes

3While we independently proposed this SFDF enumeration strategy, in a later time, we found an earlier work [67]who proposed a prefix extension tree for itemset enumeration having the similar idea of our SFDF enumeration.

28

Page 40: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

0nil

16Al

24Bl

23Br

22Ar

20W

13Ar

15Ar

14Br

12W

8Bl

17Br

19Br

18Ar

25Br

27Br

26Ar

29Br

31Br

30Ar

28W11Ar

9Ar 10Br

21Br

4W1Br 2Ar

5Br 6Ar

7Br

3Br

Figure 3.3: Subset-First Depth-First enumeration, with dynamic ordering of homophily attributes(shown in the blue dashed box)

that can be used to expand path(t) in the subtree below t. If tail(t) 6= ∅, for each attribute in

tail(t) in order, t has one child t′ labeled by the attribute. Note that by enumerating the attributes

in tail(t) following the reverse order τ , we are in fact expanding the subset path(t) by appending

the attributes in the normal order H l, NH l,W,Hr, NHr, i.e., those for LHS, followed by those for

edges, followed by those for RHS. This gives Property 1.

Let ti denote the tree node numbered i. Initially at the root t0, label(t0) = nil, path(t0) = ∅,and tail(t0) = τ . The root has five child nodes, t1, t2, t4, t8, t16, labeled Br, Ar,W,Bl, Al in that

order. Next, the SFDF order enumerates t1. tail(t1) = ∅, then t1 has no child. The next node

enumerated is t2 labeled Ar, tail(t2) = (Br), so t2 has one child t3 labeled Br. path(t3) =Ar, Br, which represents all GRs l w−→ r with L = ∅, W = ∅, and R = Ar, Br. Similarly,

t4, t5, t6, t7 are enumerated following this order.

At node t8, path(t8) = Bl and tail(t8) = (Br, Ar,W ). For the first time, a homophily

attribute,B, occurs in the LHS. This node represents the enumerated subset LWRwhere L = Bland W = R = ∅. Note β = ∅. t8 has three child nodes labeled Br, Ar,W . Following the above

order, the subset BlBr will be enumerated before the subset BlAr, then the subset BlArBr will be

enumerated as a child node of BlAr (by adding Br). Because Al does not occur in the LHS while

Bl does, this is exactly the case discussed in Remark 2 where a homophily attributeBl has a value in

l and adding a new value for Br to r changes β = ∅ to β 6= ∅, causing the lack of anti-monotonicity

of nhp.

To avoid this problem, at node t8, it helps if tail(t8) = (Ar, Br,W ) so that we can enumerate

Ar before Br and consequently the subset BlAr (at t9) is enumerated before the subset BlBr (at

t10), therefore, the subset BlBrAr (at t11) is enumerated as a child node of BlBr instead of a child

node of BlAr. This is defined by the dynamic order of tail(t) below.

Dynamic ordering of tail(t). At any node t with path(t) = LWR, let NH l, NHr,W,H l, Hr

be the same as in Eqn. (3.8). Let Hr1 and Hr

2 be the partitioning of Hr, where Hr1 contains those Ar

with the corresponding Al not occurring in path(t) and Hr2 contains those Ar with Al occurring in

29

Page 41: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

path(t). We dynamically order the attributes in tail(t) at a node t as follows:

NHr, Hr1 , H

r2 ,W,NH

l, H l. (3.9)

In other words, the homophily attributes in Hr are dynamically ordered on the basis of whether

their corresponding attributes were enumerated in the LHS at t. Consequently, the attributes in Hr2

are added to path(t) before the attributes in Hr1 if they both occur in path(t).

Consider the node t8 again. Recall that path(t8) = Bl. Then Hr1 = Ar and Hr

2 = BrbecauseAl was not enumerated in path(t8) andBl was enumerated in path(t8). Therefore, tail(t8)is dynamically ordered as (Ar, Br,W ), instead of the static order (Br, Ar,W ). This order ensures

that Br is added to path(t8) before Ar if both Br and Ar appear in the path, as shown by the

path t8, t10, t11. On the path t4, t6, t7, Br is added to path(t4) after Ar. This does not contradict

our order because no homophily attribute was enumerated in path(t4), i.e., Hr1 = Br, Ar and

Hr2 = ∅. The next theorem shows that this dynamical order restores the anti-monotonicity of nhp.

Theorem 3. Assume that tail(t) is dynamically ordered at a node t described above, and g′ and g

are non-trivial GRs where g′ is obtained from g by adding one or more values to the RHS of g. Then

nhp(g′) ≤ nhp(g).

Proof. If β 6= ∅ for g, Theorem 2(2) implies nhp(g′) ≤ nhp(g). We assume β = ∅ for g. Let g′ be

the result of adding a value b to the RHS of g. If b is a value for an attribute inHr1 orNHr, Theorem

2(3) implies nhp(g′) ≤ nhp(g). So we assume that b is a value for a homophily attribute in Hr2 .

In this case, according to the dynamic ordering of tail(t), the RHS of g contains only values for

attributes in Hr2 since the values for attributes in Hr

1 and NHr are added after those for attributes in

Hr2 . Thereby, all attributes in the RHS of g are homophily attributes and occur in the LHS. Then the

assumption β = ∅ implies that the values of these attributes are contained in the LHS of g, hence, g

is a trivial GR, contradicting our assumption. This shows that b cannot be a value for a homophily

attribute in Hr2 if β = ∅. The case of adding more values to the RHS of g follows by repeating the

above argument on g′.

The above enumeration order ensures that our depth-first traversal enumerates smaller subsets

LWR before enumerating larger ones, i.e., Property 2, adds the attributes for LHS before adding

the attributes for RHS, i.e., Property 1, and restores the anti-monotonicity of nhp, i.e., Theorem 3.

All these properties are essential for pruning GRs based on the threshold of nhp.

3.4.3 Computing Non-homophily Preference

A remaining issue is how to compute nhp at a node. Suppose that we are enumerating the current

node t for GRs l w−→ r. In nhp(l w−→ r)= supp(lw−→r)

supp(l∧w)−supp(lw−→l[β])

, supp(l w−→ r) is computed at t and

supp(l ∧ w) was computed at an earlier node because the attribute set for l ∧ w is a subset of the

attribute set for l w−→ r (i.e., Property 2). In the following discussion, we consider supp(l w−→ l[β])and assume β 6= ∅, thus, supp(l w−→ l[β]) 6= 0. Note β ⊆ R. There are two cases:

30

Page 42: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Case 1: If β ⊂ R, the node for l w−→ l[β] was enumerated at an earlier node because the attribute

set for l w−→ l[β] is a subset of the attribute set for l w−→ r. Note that l w−→ l[β] is a trivial GR, its

support can be easily computed.

Case 2: β = R. In this case, supp(l w−→ l[β]) is computed at the current node t for l w−→ r. An

example is a GR g at t27: (a2, b2)→ (a1, b1), where a2, a1 are different values for attribute A, and

b2, b1 are different values for B. So β = Ar, Br and

nhp(g) = supp((a2, b2)→ (a1, b1))supp((a2, b2))− supp((a2, b2)→ (a2, b2)) . (3.10)

If we generate (a2, b2) → (a2, b2) before generating any other GRs with (a2, b2) on the LHS,

supp((a2, b2) → (a2, b2)) will be available when generating g. Enforcing this order only requires

knowing the LHS of the current GR g, i.e., (a2, b2) in this example, therefore, can be easily imple-

mented.

In both cases, supp(l w−→ l[β]) is either already computed or can be computed at the same node

as for l w−→ r. Therefore, nhp(l w−→ r) can be computed at the node for l w−→ r.

3.4.4 Algorithm

We now present the algorithm framework. Overall, it enumerates each attribute subset (LWR)

following the SFDF order, partitions the data stored in the format as in Figure 3.2 using the attribute

set recursively, and prunes further partitioning using the thresholds on supp and nhp.

Algorithm 1, GRMiner, gives the pseudo-code of our algorithm. The input of the algorithm is

LArray, EArray and RArray. tail() returns the attributes (dimensions) that will be used to expand the

attribute set LWR, similar to tail(t) in Section 3.4.2. Initially, tail(nil) returns all the attributes in

the order in Eqn. (3.9). In our running example, tail(nil) = Br, Ar,W,Bl, Al, where Br, Aris in RArray, W is in EArray, and Bl, Al is in LArray.

At the current node t of the tree, data denotes the data partition generated by LWR at t.

Since the attributes in tail(t) are contained in the tables LArray, EArray, and RArray, we use three

recursive procedures RIGHT(data, Tail), EDGE(data, Tail) and LEFT(data, Tail) to partition

data, where Tail is a variable for tail(t). Initially, data is the entire tables LArray, EArray, and

RArray and Tail = tail(nil), at lines 2 - 4. Partitioning data by an attribute in tail(t) generates the

partitions for a child node created. These calls then search recursively deeper into the enumeration

tree, explained below. On return from all calls, top[k] contains the top-k GRs.

LEFT (data, Tail) partitions data using each dimension occurring both in Tail and in LArray

(line 7) (i.e., the dimensions in Tail contained in LArray). By abuse of notation, for each partition p,

we also use p to denote the corresponding GR. supp(p) returns the support of p and p.Att returns the

attributes on which p has been partitioned. p.Att corresponds to path(t) in Section 3.4.2. If supp(p)< minSupp, the procedure returns immediately (line 10), otherwise, p is recursively partitioned

on the next three lines. The functions getRight(p) and getEdge(p) expand the partition p to the

records in RArray and EArray, respectively.

31

Page 43: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Algorithm 1: GRMinerInput : LArray, EArray and RArrayOutput: top[k]

1 Procedure Main()2 RIGHT(RArray, tail(nil));3 EDGE(EArray, tail(nil));4 LEFT(LArray, tail(nil)));5 return top[k];6 Procedure LEFT(data, Tail)7 forall dimension d both in Tail and in LArray do8 forall partition p of data on dimension d do9 if supp(p) < minSupp then

10 return;

11 RIGHT(getRight(p), tail(p.Att));12 EDGE(getEdge(p), tail(p.Att));13 LEFT(p, tail(p.Att));

14 Procedure EDGE(data, Tail)15 forall dimension d both in Tail and in EArray do16 forall partition p of data on dimension d do17 if supp(p) < minSupp then18 return;

19 RIGHT(getRight(p), tail(p.Att));20 EDGE(p, tail(p.Att));

21 Procedure RIGHT(data, Tail)22 forall dimension d both in Tail and in RArray do23 forall partition p of data on dimension d do24 if supp(p) < minSupp OR nhp(p) < minNhp then25 return;

26 if p is a non-trivial GR and no more general GR than p found then27 update top[k] and minNhp if necessary;

28 RIGHT(p, tail(p.Att));

EDGE(data, Tail) is similar to LEFT (data, Tail) except that it partitions data by each

dimension occurring both in Tail and in EArray, and recursively processes each partition p by the

calls RIGHT () and EDGE().

RIGHT (data, Tail) partitions data by each dimension occurring both in Tail and in RArray,

and recurs on each partition. Line 24 checks if pmeets the thresholds for support and non-homophily

preference, and Line 26 checks if p represents a non-trivial GR and if a more general GR than the

GR for p was generated before. Since our enumeration examines smaller subsets of attributes before

examining larger subsets, once a GR passes this checking, no later GR can be more general than it,

32

Page 44: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

so every GR in top[k] is a most general GR. Line 27 updates the top[k] list if the GR for p is among

the top k GRs so far, and upgrades minNhp by the non-homophily preference of the least ranked

GR in top[k].

Corollary 1. (1) A non-trivial GR is examined by Algorithm 1 only if it passes both minSupp and

minNhp. (2) top[k] returned by GRMiner contains the top-k GRs.

The above corollary holds because a partition p is considered as a non-trivial GR only if p passes

the both thresholds (Line 26), and Algorithm 1 never misses any non-trivial general GR who satisfies

the thresholds due to the carefully designed enumeration and pruning strategies. The complexity of

this algorithm is proportional to the number of GRs examined. (1) implies that no time is spent on

examining any non-trivial GRs that do not meet the thresholds minSupp and minNhp, thanks to

the checking at lines 9, 17, 24, and Theorem 3. Typically, much fewer GRs are examined because

minNhp is dynamically updated to the smallest non-homophily preference of the current top-k

GRs (line 27). We will examine this effect of minNhp on real life data sets in Section 3.5.

3.5 Experimental Evaluation

We evaluated the GRMiner algorithm on real life data on CentOS 6.4 with Intel 8-core processors

2.53GHz and 12G of RAM. The programs were written in C++.

3.5.1 Data Sets

We used two public real-world data sets: Pokec Social Network data4 and DBLP co-authorship data5

because the domains of these data sets are easy to understand, which is essential for interestingness

studies.

Pokec Social Network Data. Pokec is the most popular online social network service in Slo-

vakia for discovering, chatting and dating with online friends. This data set contains anonymized

users with profile data and directed friendships between users. We extracted 6 most important

node attributes: Gender (G,3), Age (A,11), Region (R,188), Education (E,10),

What-Looking-For (L,11), and Marital Status (S,7), where the letter and number

in a bracket are the abbreviation and domain size of an attribute. We specify A,R,E,L as ho-

mophily attributes. While all attributes have drop lists for choosing their values, E,L, S are also

fillable with any text. We used the values from the drop list whenever they were chosen, and other-

wise, the user-filled text subject to the following preprocessing in order: (1) Remove all characters

except letters and apply standard IR pre-processing to the filled text. (2) For the words that occur

in more than 200 user profiles, replace them by their English synonym and mark the other words

as “invalid”. (3) Use the highest level for E (for example, keep “Master” if both “Bachelor” and

4http://snap.stanford.edu/data/soc-pokec.html

5http://dblp.uni-trier.de/xml/

33

Page 45: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

“Master” are filled); and for L and S, use the word with highest frequency. (4) Keep only user

profiles containing no “invalid” value. The final induced graph has 1,436,515 (87.98%) users and

21,078,140 (68.83%) directed edges. In addition, we discretized the domain of Age into “0-6”,

“7-13”, “14-17”, “18-24”, “25-34”, “35-44”, “45-54”, “55-64”, “65-79”, and “80 or older”.

DBLP Data. This is the co-authorship DBLP data set used in [127], and it contains 28,702

authors and 66,832 directed co-author relationships (we replace each undirected edge with two di-

rected edges in opposite directions). Each author has two node attributes, Area (A) with 4 values

DB, DM, AI and IR, and Productivity with 4 values Poor, Fair, Good and Excellent. We use

the exact same criteria as in [127] to discretize the values for the two attributes. Definitely, an author

may belong to multiple areas, we select one only among the four in which she/he publishes most.

Note that we only use the exact same data set in [127] for experimental study and do not discuss the

reasonability of the criteria for discretizing the values. We specify Area as a homophily attribute

since authors in the same areas tend to collaborate; while we specify Productivity as a non-

homophily one, since it is common that students and professors are co-authors but generally students

have much fewer publications than professors. We construct one edge attribute Collaboration

Strength (S) with three domain values: occasional (f = 1), moderate (2 ≤ f < 5), often

(f ≥ 5), where f is the number of papers co-authored by the two authors at the ends of an edge.

We evaluated the interestingness of GRs in Sections 3.5.2 and 3.5.3, and evaluated the efficiency

of the GRMiner algorithm in Section 3.5.4.

3.5.2 Interestingness Study for Pokec Data

One of our claims is that the proposed non-homophily preference metric (i.e., nhp) helps to iden-

tify interesting social ties beyond the well-known homophily principle. We evaluate this claim by

comparing the top-k GRs ranked by nhp with the top-k GRs ranked by the standard confidence,

conf. Note that when applying conf, homophily effect is not excluded. As we mentioned in Section

3.2.3, the setting of minSupp and minNhp and k is data and application specific and is usually

done by trial and error. For the interestingness study here, we set minSupp = 0.1% (i.e., absolute

minSupp = 21078), minNhp and minConf at 50%, and k = 300, i.e., relatively smaller thresholds

and a larger k, to allow more strong GRs to be returned. Table 3.2 shows the top-5 GRs ranked by

nhp (in boldface) and top-5 GRs ranked by conf, plus one less ranked GR by nhp (the last row). 4

of the top-5 GRs ranked by conf are trivially expected from the homophily principle as both LHS

and RHs contain the same value; this trend continues further down the list (not shown here). This

suggests that the conf metric fails to find interesting relationships beyond what is known from the

homophily effect. In contrast, the GRs ranked by nhp, i.e., P1-P5 and P207, tend to provide more

insights. The conf of these GRs are included for comparison. These GRs are found because their

nhp is high, even though their conf is low. Note that the proportion of data covered by a GR is

captured by supp. We pick P2, P5, and P207 to discuss in details, other GRs are interpreted in a

similar way.

34

Page 46: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Table 3.2: Comparison of top GRs ranked by nhp and conf for Pokec data set

Ranked by nhp Ranked by conf

P1:(L:Chat)→(L:Good Friend)nhp = 69.5%; supp = 649723(conf = 30.9%)

(R:27)→(R:27)conf = 72.2%; supp = 250930

P2:(E:Basic)→(E:Secondary)nhp = 68.7%; supp = 682715(conf = 15.4%)

(R:24)→(R:24)conf = 66.1%; supp = 197374

P3:(E:Preschool)→(E:Basic)nhp = 66.1%; supp = 54765(conf = 30.4%)

(R:32)→(R:32)conf = 65.1%; supp = 143219

P4:(E:Hardly Any)→(E:Basic)nhp = 65%; supp = 34099(conf = 30.7%)

(R:10)→(R:10)conf = 65%; supp = 279623

P5:(L:Sexual Partner)→ (G:Female)nhp = 64.7%; supp = 468012(conf = 64.7%)

(L:Sexual Partner)→ (G:Female)conf = 64.7%; supp = 468012

P207:(G:Male, A:25-34)→ (A:18-24)nhp = 50.8%; supp = 593785(conf = 33.9%)

P2: (E:Basic)→ (E:Secondary). This GR indicates that for people with Basic education, when

not partnering with people with the same education as their own, they preferred (in 68.7% cases)

those with Secondary education. Before coming to a conclusion that we can leverage P2 to recom-

mend the people with Secondary education as dating partners to those with Basic education, we

may be curious about the fact that with Training being the closer education level to Basic, this GR is

less expected from homophily of Education because Training is expected to be more popular among

people with Basic education. Further examination of data reveals that the proportion of Secondary

is 19.54% and the proportion of Training is only 1.9%, which is probably the reason for the high

nhp of this GR.

P5: (L:Sexual Partner)→ (G:Female). For this GR, nhp degenerates into conf because β = ∅(no homophily attribute occurs on both sides). This GR suggests that for people describing them-

selves as looking for sexual partners, 64.7% of their partners are female. Starting with this GR and

wondering whether gender has any impact on this behavior, we formed the following two hypothesis

by varying P5, and queried their nhp and supp from the data:

(G : Male, L : Sexual Partner)→ (G : Female) nhp = 68.1%; supp = 392652

(G : Female, L : Sexual Partner)→ (G : Male) nhp = 48.8%; supp = 71699.

This pair suggests a big difference in the preference of opposite sex partners by males and females

when looking for sexual partners, which could be useful in demographic and social research. With-

out first finding P5, it is difficult to find this difference from the collection of GRs.

35

Page 47: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Table 3.3: Comparison of top GRs ranked by nhp and conf for DBLP data set

Ranked by nhp Ranked by conf

D1:(A:AI)→(P:Poor)nhp = 74.3%; supp = 31330(conf = 74.3%)

(A:AI)→(A:AI)conf = 88.8%; supp = 37458

D2:(A:DB) often−−−−→(A:DM)nhp = 71.5%; supp = 98(conf = 6.98%)

(A:DB)→(A:DB)conf = 88.7%; supp = 44980

D3:(P:Poor)→(P:Poor)nhp = 70.6%; supp = 63174(conf = 70.6%)

(A:IR)→(A:IR)conf = 75.9%; supp = 16020

D4:(P:Excellent)→(A:DB)nhp = 68.1%; supp = 2744(conf = 68.1%)

(A:AI)→(P:Poor)conf = 74.3%; supp = 31330

D5:(A:IR)→(P:Poor)nhp = 68.1%; supp = 14368(conf = 68.1%)

(A:DM)→(A:DM)conf = 72.3%; supp = 14232

D16:(A:AI, P:Good)→(A:DM)nhp = 55.2%; supp = 272(conf = 11.6%)

P207: (G:Male, A:25-34) → (A:18-24). Again, we form hypothesis from the seed P207. We

replace Male with Female on the LHS and get nhp = 32.8% and supp = 204780, which suggests

that women much less preferred younger partners than men. The next two variations show that this

difference is even bigger for partner with opposite sex:

(G : Male, A : 25-34)→ (G : Female, A : 18-24) nhp = 39.1%; supp = 456201

(G : Female, A : 25-34)→ (G : Male, A : 18-24) nhp = 12.8%; supp = 80070.

These results remind us that when considering the age factor in recommender models, males and

females should be treated differently.

3.5.3 Interestingness Study for DBLP Data

For DBLP data, we set minSupp = 0.1% (i.e., absolute minSupp = 67), minNhp and minConf at

50%, and k = 20. Table 3.3 shows the top GRs ranked by nhp (in boldface) and conf. Similar to the

study on Pokec Data, the top GRs ranked by nhp are more interesting than those ranked by conf.

Recall that Area (A) is a homophily attribute and Productivity(P) is not.

D1 & D3 & D5: On surface, D1 & D3 & D5 suggests the preference to authors with Poor

productivity. This is interesting as it contradicts with the common sense. A quick check on the data

(by examining the values distribution on the attribute) tells that 91.18% of the authors have the value

Poor for P because many authors are students and most co-authorship is between supervisors and

students.

36

Page 48: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

D2: (A:DB)often−−−→(A:DM) D2 suggests that authors in the DB area often collaborate with those

in the DM area when collaborating with those not in their own area. D16 is a similar pattern for

authors in AI area. In fact, DM has the least proportion among all areas. Therefore, these GRs

represent a true preference to DM, not due to data skewness. A possible reason is that DM is an

interdisciplinary field that intersects database and machine learning (a subarea of AI).

Remark 3. Though we have seen that the top-k GRs can be useful in many applications, such as

user behavior analysis, social research, recommendation, etc, it would be naive to expect that all

top-k GRs ranked by an objective measure are “pure gold”. Finding top-k GRs typically serves

the entry point in pattern mining. In the above case studies, the human analyst starts with top-k

GRs found, forms new hypothesis through varying the GRs found, and compares such hypothesis

as well as data distribution. This process can apply to the new hypothesis recursively. This cycle of

hypothesis formulation and hypothesis comparison often leads to new insights into the behaviors of

different groups of actors or an explanation of the presence of a GR. Unlike manual probing of a

data set, top-k GRs provide an entry point to this cycle by filtering many uninteresting and trivial

patterns.

3.5.4 Efficiency of Algorithms

Our algorithm finished running on the DBLP data set in no more than 0.483 seconds for all pa-

rameter settings. Therefore, our study below focuses on the Pokec data, which is much larger than

the DBLP data. GRMiner(k) denotes the algorithm that pushes all the constraints of minSupp,

minNhp, top-k, and generality of GRs to prune search space, as described in Section 3.5.4. GR-Miner pushes all constraints except for the top-k constraint. The difference will tell the effectiveness

of dynamically upgrading minNhp to that of top-k GRs.

We consider two baseline solutions. One stores the node and edge attributes information in a

single table, applies the BUC algorithm [8] to mine the combinations of attribute values above the

threshold minSupp. We denote this baseline by BL1. The second baseline, BL2, is similar to BL1but works with the node and edge attributes information separately stored in three tables. Both

baselines prune the search space using the anti-monotonicity of support, but not minNhp, and find

the top-k GRs in a post-processing step.

Unless otherwise stated, we consider the four node attributes with largest domain sizes, i.e.,

Age, Region, Education and What-looking-for, for examining various parameter set-

tings. So the dimensionality of search space for GRs is 8. We set the ranges of (absolute)minSupp,

minNhp, and k to [2, 10000], [0%,100%], [1,10000], respectively, with the default settings 50,

50%, and 100. Figure 3.4 summarizes the comparison on runtime of all algorithms.

minSupp. Figure 3.4a presents runtime vs minSupp. For a small minSupp, the runtime of

BL1 and BL2 increases quickly while the runtime of GRMiner(k) and GRMiner remains relatively

stable, even when minSupp reduces to 2. The efficiency of GRMiner(k) and GRMiner in the

case of a small minSupp comes from the fact that these algorithms prune the search space using

37

Page 49: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

100 102 104100

200

300

400

500

600

GRMiner(k) GRMiner BL2 BL1

1 10 100 1000 10000100

200

300

400

500

600

minSupp (absolute value)

Time (sec)

(a) Time vs minSupp

0 20 40 60 80 100100

150

200

250

300

minNhp (%)

Time (sec)

(b) Time vs minNhp

020

4060

80100 1

100

10000100

150

200

250

kminNhp (%)

Time (sec)

(c) Time vs k and minNhp

4 6 8 10 120

500

1000

1500

2000

2500

3000

Dimension

Time (sec)

(d) Time vs dimensionality

Figure 3.4: Runtime for mining GRs for Pokec data

minNhp. This is a huge advantage because a small minSupp is often required for finding GRs

with a high non-homophily preference that typically exist between small user groups.

minNhp and Top-k. Figure 3.4b studies the effect of minNhp. BL1 and BL2 do not benefit

from a larger minNhp since they employ only minSupp for pruning. GRMiner(k) and GRMinerare significantly faster thanks to the minNhp based pruning. For a small minNhp, GRMiner(k)is faster than GRMiner by dynamically upgrading minNhp to the smallest nhp of the top-k GRs

found. Figure 3.4c examines the joint effect of k and minNhp on GRMiner(k). As long as one of

the two constraints is tight, i.e., a small k or a large minNhp, the pruning is effective. With a small

k, the smallest nhp of top-k GRs is likely high, so the upgraded minNhp has a similar effect to

having a large user-specified minNhp.

Dimensionality. Figure 3.4d shows the effect of the dimensionality 2l, when the first l node

attributes listed in Section 3.5.1 are included and l varies from 2 to 6. All other parameters are set

to their default settings. As the data has more node attributes, the runtime for the two baselines

increases exponentially, but that for GRMiner(k) and GRMiner increases much slower. This is

because, for the baselines, more attributes bring in exponential growth of the number of attribute

combinations, while as more attributes can occur on RHS, there is more room for minNhp pruning

in GRMiner(k) and GRMiner according to Theorem 3.

38

Page 50: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

3.6 Summary and Extensions

Given the homophily observed on social interactions, we considered the problem of mining interest-

ing interaction patterns that are not expected from homophily by excluding the impact of homophily

from the interestingness metrics of social ties. We motivated and formulated this problem as min-

ing top-k group relationships from a social network with respect to a specification of homophily

attributes. We presented an efficient solution to this problem with a focus on pushing the new in-

terestingness metric to prune the search space. We consider finding top-k group relationships as

the start of analysis, not the end. However, this starting step is important as it provides the user

with some sorts of leads to start with. Our empirical study on two real data sets demonstrated the

potential of this approach in finding interesting social patterns.

While non-homophily preference (nhp) is defined for the problem of mining GRs beyond ho-

mophily in this work, the algorithm framework in Section 3.4.4 can be extended to different inter-

estingness metrics to solve different tasks.

The support-confidence metric has some drawbacks and several alternatives have been sug-

gested to address these drawbacks in the literature. See [7, 60] for a discussion and motivation of

such alternatives. The following are several examples of such alternative metrics after being adopted

to a GR lw−→ r:

laplace(l w−→ r) = supp(l w−→ r) + 1supp(l ∧ w) + k

, (3.11)

where k is an integer greater than 1.

gain(l w−→ r) = supp(l w−→ r)− θ × supp(l ∧ w), (3.12)

where θ is a a fractional constant between 0 and 1.

Piatetsky_Shapiro(l w−→ r) = supp(l w−→ r)− supp(l ∧ w)supp(r)|E|

, (3.13)

conviction(l w−→ r) = |E| − supp(r)|E|(1− conf(l w−→ r))

, (3.14)

lift(l w−→ r) = |E|conf(l w−→ r)supp(r) . (3.15)

For example, a GR, l w−→ r, has a high confidence, but the true reason for this is that the relevant

attribute value on RHS has a high population among all the edges, i.e., supp(r) or conf(∅ w−→ r) =supp(r)|E| is high. One example is the GR D1 found in Section 3.5.3, which does not represent an

interesting pattern. The lift metric, defined in Eqn. (3.15), can reduce the influence of this data

skewness.

To adopt these alternative metrics for our algorithms for mining interesting GRs, a key observa-

tion is that all the above alternative metrics are defined using three supports, namely, supp(l w−→ r),

39

Page 51: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

supp(l ∧ w), and supp(r), and these supports are easily computed. Therefore, in principle, the al-

gorithm for mining top-k GRs presented in this work can be applied as well if the nhp is replaced

with one of the above alternative metrics. In addition, for laplace or gain, the anti-monotonicity

remains valid (proof omitted). This means that similar to the regular confidence based pruning, can-

didate GRs can be pruned based on a given threshold on laplace or gain. For Piatetsky_Shapiro,

conviction, and lift, the corresponding pruning is not available because these metrics do not have

the anti-monotonicity with respect to the RHS r, but the support based pruning is still applicable.

For such metrics, the top-k GRs have to be found in a post-processing step after finding all the GRs

satisfying the threshold on support.

40

Page 52: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Chapter 4

Personalized Trip RecommendationMeets Real-world Constraints

4.1 Motivations and Contributions

In LBSN services such as Yelp and Foursquare, users are able to “check in” at a POI, such as restau-

rant/museum/park, via their mobile devices. A user may rate and make comments after visiting a

POI and other users may consider those ratings and comments to select the POIs for their visits

at a later time. The availability of such rating data and LBSN services open up an array of new

research problems in both academia and industry, such as user behavior analysis, movement pattern

study [21, 59], and various real-world applications [22, 118, 132]. Among them, trip recommenda-

tion [53, 114] is a hot topic, which has values on multiple aspects such as personal entertainment,

economics of a city, society building, etc. The majority of current trip recommendation or route

planning systems focuses on shortest/least cost paths or explores points of interest (POIs) that are

popular or geographically close [133, 116]. The popular travel planner, Google Trips, only suggests

day plans traversing famous places or user selected POIs, and it is not able to respond a user’s

detailed requirement.

We are interested in the personalized trip recommendation problem in which a user travels to a

new region (e.g., on a business trip to a new city) and wants to visit several POIs within a limited

amount of time. The goal is to recommend a trip route visiting several POIs according to not only the

temporal-spatial constraints (more details shortly), but also the user specific preferences on POIs.

And the problem becomes more interesting and challenging by considering the following aspects.

• (Personalization) First, while a user has its own interests, explicitly soliciting this information

does not work in large scale applications because the user often does not know what POIs are

available and where they are. Modeling user preferences by learning from historical rating and

check-in behaviors of users and their peers to predict the user’s preferences on unvisited POIs would

be a preferred solution.

• (Order and spatial constraints of POIs) Second, the traditional POI recommendation recom-

mends individual POIs with highest scores, such POIs may not form a feasible trip due to the spatial

41

Page 53: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

source

destination

A (park) C (park)

D (shop)

B (restaurant)

Figure 4.1: Example of trip recommendation.

and time constraints. For example, as illustrated in Figure 4.1, though A (a national park) and B

(a famous restaurant) have the highest scores individually, it is not feasible to visit both A and B,

and end at the specified destination, due to the user’s time constraint and the long travel distance

between the two. In this case, recommending A and C (another park) is likely more suitable, even

if C may have a slightly lower score than B.

• (POI availability and uncertain traveling time) Third, the traditional trip recommendation

assumes that POIs are always available any time and the traveling time between two POIs are known

in advance, but in practice, a POI may be available only at certain times (say, due to opening hours

and closing hours) and traveling time is uncertain due to traffic conditions at the time of travel. As a

result, whether a POI can be visited will depend on its available time and predictability of the time

traveling to the POI. If the timeliness of finishing the trip is important to the user, a trip with a more

predictable traveling time would be preferred. For example, the user may give up one more POI to

visit in order to ensure a high probability of visiting another more preferred POI or arriving at the

specified destination on time.

• (POI diversity) Fourth, the diversity of POIs included in a trip also affects user satisfaction

since users are usually interested in different types of attraction or expect a variety of activities such

that they want to explore multiple categories of POIs during the trip; otherwise, a trip consists of

too few (even single) categories of POIs is boring. The category of POIs could be museum, park,

shop, restaurant, etc. For example, if the user wants to visit at least two categories of POIs in the

case of Figure 4.1, the trip with green line cannot meet user’s needs while the trip with red line is

feasible. Similar arguments are also suggested by a few previous works, such as [3, 34]. However,

they either expect user to manually select a POI from each desired category or assume a fixed order

on the categories of POIs, of which the restrictions are too strong.

• (Large search space) Finally, the POI availability and uncertain traveling time imply each

order of visiting a set of POIs may have a different consequence, thus, a brute-force search of

all candidate trips is prohibitive. For example, with 150 POIs in total, the number of trips that

consist of 5 POIs can reach billions (i.e., 150!/(150 - 5)!). Most of these candidate trips do not

42

Page 54: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

follow the POI availability or match user’s preferences, or cannot be finished within a given time

limited. A strategy that prunes such infeasible and non-optimal trips based on user preferences, POI

availability, traveling time uncertainty is essential for scaling a solution to large applications.

We presented a comprehensive related work review about the trip recommendation problem

in Section 2.2. Here we specifically comapare with [34]. Similar to ours, [34] formulated the trip

recommendation as a constrained objective function optimization problem. However, it assumes

that POIs can be grouped into several types or categories and the user knows the order of visiting

POI types and likes to visit POIs of each type exactly once in a pre-determined order. The restriction

significantly reduces the search space. In real world applications, however, the user may not provide

this order either because she does not care about the order or because she is concerned that such a

fixed order may restrict her options. In addition, their work does not consider the POI availability

and the uncertainty of traveling time.

Contributions

We make the following contributions of this work.

• We address the trip recommendation by taking into account the following information and

constraints: (1) the user’s personalized preferences on POIs; (2) the user’s time budget that

constrains the total traveling and visiting time; (3) the time window for the POI availability;

(4) the uncertainty of traveling time between POIs; (5) the diversity of POIs that constrains

the minimum number of POI categories. We formulate the above requirements in our TripRecproblem to find an optimal trip that maximizes user happiness, under the constraints that all

the POIs in the trip can be visited and the trip can be completed within the user time budget

with a probability no less than a user specified threshold, moreover, the trip covers a user

specified minimum number of POI categories. (Section 4.2)

• We present the personalized rating estimation for POIs by applying collaborative filtering to

items with features, and the modeling of POI availability, uncertain traveling time and diver-

sity of POIs. These are the key factors that distinguish our modeling of trip recommendation

from previous ones. (Section 4.3)

• When the user’s time budget constraint and start/destination locations are provided, we search

for the optimal trip route under the various constraints discussed above. We present two op-

timal solutions that guarantee to find the optimal trip if it exists. One is based on a state

expansion approach and one is based on a prefix based depth-first search strategy, both of

them incorporate the constraints and the dominance based pruning to reduce search space.

(Section 4.4 and 4.5)

• We also present two heuristic solutions that find “good trips” with a significantly better run-

time than the optimal solutions. (Section 4.6)

43

Page 55: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

• We evaluated all solutions on two real life LBSN data sets, Yelp and Foursquare, and demon-

strated the superiority over previous trip recommendation algorithms. (Section 4.7)

4.2 Problem Statements

In this section, we describe the preliminary concepts and formulate the problem studied in this

work. We begin by summarizing the main notations and their corresponding interpretations to be

used throughout this chapter in Table 4.1 for easy reference.

Table 4.1: Frequently used notations

Notation GlossG = (V, E) POI graph G with POI set V and edge set E

mi touring time for POI i[Oi, Ci] opening and closing time for POI iti,j traveling time from POI i to j

rui or r∗ui observed rating or estimated rating of user u for POI iP a trip routex, y source and destination location of a tripT0 departure time of a tripb time budget of a trip

F (P, u) score of a trip route P for user uψ(P) completion probability of trip route Pθ completion probability thresholdφi category of POI iβ POI diversity threshold

POI graph: We assume that there are n POIs in a directed complete graph G = (V, E).

V = 1, · · · , n is a set of POIs. Each POI i ∈ V is associated with the following information:

a touring time mi, indicating the typical or average staying time for users, and opening hours

[Oi, Ci], indicating that i opens at time Oi and closes at time Ci. Each edge ei,j ∈ E represents

the pre-computed shortest route from i to j, where i, j ∈ V , and associates with a traveling time

E[ti,j ], where ti,j follows a distribution with probability density function fi,j(·) and E[ti,j ] is its

expectation. We assume that E[ti,j ] and these functions fi,j(·) are given and that traveling times for

different routes ei,j are independent.

Rating matrix: We consider a set of users where a user u may rate a POI i after visiting i.

A rating matrix R contains all observed ratings rui. The rating matrix is usually extremely sparse

with most entries undefined since a user may only rate a few POIs. Besides, a user u could leave

comments on POI i when rating i, represented by a bag of words Bui (If a user u does not rate a

POI i, Bui = ∅). The “content” of POI i is defined as Bi =⋃uBui. Based on Bi, we could derive

the category of POI i, denoted as φi. Based on the matrix R and the comments, we could estimate a

user u’s rating for an unvisited POI j, denoted as r∗uj .

A trip route: For a specified source location x and a destination location y, and a departure time

T0, where x and y are not necessarily distinct, a trip route has the form x→ · · · i · · · → y, that starts

44

Page 56: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

from x at the time T0, visits each POI i listed in the route in order, and ends at y. We assume that the

probability density functions fx,i(·) and fi,y(·) are known for any POI i, and we set r∗ux = r∗uy = 0,

mx = my = 0, Ox = Oy = T0, Cx = Cy = +∞. Such settings ensure that visiting x and y does

not cost time because they serve only as the departure and destination locations for a trip. The score

of a trip route P for a user u is defined by an additive function

F (P, u) =∑i∈P

r∗ui. (4.1)

This function simply sums up the estimated ratings r∗ui for all POIs in the route, which models the

happiness of u with respect to the route P .

Constraints on a trip route: We have four types of constraints on a trip route as below:

• POI availability constraint: a user is considered to “visit” a POI i only if the user spends mi

time at i during the opening hours [Oi, Ci]. Therefore, if a user arrives at i beforeOi, she has to wait

until the opening hour, and the user should arrive at i no later than Ci −mi to gain the happiness

score. We do not consider “passing by” an intermediate POI i, in order to visit the next POI, as

visiting the POI because it does not need to spend mi time at i. For this reason, a trip route visits

each POI at most once and all POIs that are just passed by will not be included in a trip route.

• Time budget constraint: the whole trip is completed within a period of time b, including trav-

eling time E[ti,j ] between POIs and touring time mi at POIs.

• Completion probability constraint: the probability that a trip finishes at the destination y by

the time T0 + b is not less than a user specified threshold θ ∈ [0, 1).

• POI diversity constraint: an integer specifies that the user wishes to visit at least β different

categories of POIs in a trip for touring diversity. Note that the threshold β controls the minimum

number of POI categories, not the minimum number of POIs.

4.2.1 Personalized Trip Recommendation Problem

Problem 2. [TripRec] Given a POI graph G = (V, E) and associated touring time mi, opening

hours [Oi, Ci], bag of words Bi on each POI i ∈ V , and the probability density function fi,j(·)for the traveling time on each edge ei,j ∈ E , a rating matrix R, user u with the source x and the

destination y, a departure time T0, a time budget b, a completion probability threshold θ ∈ [0, 1)and a diversity constraint β, we want to find an optimal trip route P that maximizes user happiness

F (P, u) under the following constraints: (1) it starts at location x and ends at location y; (2) it

satisfies the POI availability constraint; (3) it completes within the time budget; (4) it satisfies the

completion probability constraint; (5) it satisfies the POI diversity constraint.

Theorem 4. The TripRec problem is NP-hard.

Proof. We first introduce a well-known NP-hard problem - the Orienteering Problem [16], which

is defined as: given a set of vertices, each with a score, the goal is to determine a path, limited

in length (budget), that visits some vertices and maximizes the sum of the collected scores. The

45

Page 57: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

TripRec problem can be generalized to the Orienteering Problem, by ignoring the touring time, the

POI availability constraint, the uncertain traveling time, the completion probability constraint and

the POI diversity constraint. In particular, we assume that T0 = 0, mi = 0, [Oi, Ci] = [0,∞], the

traveling time ti,j is a fixed constant, the completion probability threshold θ = 0, and the diversity

threshold β = 1. Under these assumptions, the TripRec problem is identical to the Orienteering

Problem. Therefore, the TripRec problem is NP-hard.

Note that though the problem is theoretically NP-hard and the problem setting is complicated

by the incorporation of the real-world constraints on POI availability, POI diversity and completion

probability, on the other hand, these constraints actually also provide the opportunity for us to lever-

age them for pruning and greatly reduce the search space. This is an important reason why we still

chase for efficient optimal solution for the TripRec problem.

In the following sections, we first model the user’s personalized preferences and the trip con-

straints; then, we present several approaches to search the optimal trip route according to the esti-

mated preferences for TripRec.

4.3 Modeling Preferences and Constraints

In this section, we discuss our modeling of user preference and trip constraints.

4.3.1 Estimating User Preferences

Most existing POI recommendation methods either consider no content information of POIs or

treat content information as side information. We believe that content information of POIs should

play a more central role in user preference in that a user likes a POI because of certain features of

the POI. To this end, we adopt the feature-centric collaborative filtering proposed in [124]. Unlike

the traditional collaborative filtering on POIs, this approach performs collaborative filtering on the

features of POIs and determines the rating on a POI using the predicted ratings on the features of

the POI.

First, we transform the original user-POI rating matrix R into a user-feature matrix R′, where

each row represents a user u and each column represents a feature f in⋃iBi for POIs i. We assume

that the user may select some features of the POI when she rates it. If the user rates the POI j

but does not select specifically any feature of j, it is assumed that all features of j are selected by

default. An entry (u, f) in R′ stores the aggregated rating on the feature f over the POIs i such that

Bi contains f and i are rated by u:

guf = agg(rui|f ∈ Bui and rui is defined). (4.2)

In this work, agg(X) returns the average of the values in X , but other aggregation operations are

possible. agg(X) is undefined if X is empty.

46

Page 58: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Then, we apply matrix factorization [48] to R′ to extract the latent user vectors pu for users

u and the latent feature vectors qf for features f . To predict the rating of user i on a POI i, we

aggregate the predicted ratings pTu qf over all the features f in Bi:

r∗ui = agg(pTu qf |f ∈ Bi). (4.3)

We will use r∗ui as the estimated rating of a user u on a POI i. Thus, if the user is estimated to rate

highly most features of a POI, the user is estimated to rate highly the POI. Note that the estimation

of user ratings is performed offline and only once as it applies to all users.

4.3.2 Modeling Time budget and POI Availability Constraints

We first assume the traveling time ti,j between two POIs is deterministic. Note that, there maybe

no direct edge between two POIs in the graph. To obtain any ti,j in advance, we use any existing

shortest path algorithm, such as Floyd-Warshal algorithm [30], to compute the pair-wise traveling

time in a preprocessing step. This is a one-time computation and the results are stored for further

usage. It deals with general graph, even the triangle inequality is not enforced on the graph.

The basic idea of trip planning is to extend the route P gradually. Suppose that i is the last

POI of P , which satisfies the time budget and POI availability constraints, and πi is the starting

time of visiting i. We may extend the route by adding a new POI j after visiting i. We use the

Sat function to test if the POI availability and the time budget constraints are satisfied after the

extension. Sat(i, j, πi) returns true if πj + mj ≤ Cj and πj + mj + tj,y ≤ T0 + b, where πj =maxπi +mi + ti,j , Oj indicates the starting time of visiting j. This testing ensures that the user

can get the full service at the POI j and still reach the destination y within the time budget.

4.3.3 Modeling Uncertain Traveling Time

The above assumes that traveling time ti,j for a sub-route i→ j is deterministic. However, even the

traveling time can be estimated from historical data and external resources [81], the real traveling

time remains uncertain due to many uncertain factors that could affect the traffic. To model this

uncertainty, we shall treat the traveling time ti,j as a random variable following a certain distribution

with probability density function fi,j(·). Let E[ti,j ] denote the expectation of ti,j . In this case, the

best one can guarantee is that the probability that a trip P can be finished within a given time budget

b is above some specified threshold θ. This probability, called completion probability, is denoted

by ψ(P) so that ψ(P) ≥ θ. Note that the preprocessing step of computing the shortest paths as in

Section 4.3.2 is also required, which acquires the weight E[ti,j ] of each edge ei,j of the graph.

We can modify the above constraint testing function Sat for uncertain traveling time as follows.

Sat(i, j, πi) returns true if πj + mj ≤ Cj , πj + mj + E[tj,y] ≤ T0 + b, and ψ(P) ≥ θ, where

πj = maxπi +mi + E[ti,j ], Oj indicates the expected starting time of visiting j. We emphasize

that we use the constant E[ti,j ] as the real traveling time from POI i to POI j in a trip, and the

47

Page 59: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

uncertainty of traveling time is only modeled in ψ(P) for computing the completion probability of

trip route P given budget b.

Let us derive ψ(P) for a route P . Consider a sub-route i → j, we assume that the probability

density function fi,j(·) is known. For simplicity, we also assume that ti,j are independent for dif-

ferent pairs (i, j). Let χ denote the traveling time of the sub-route. The probability of the traveling

time less than t is given as follows:

P (χ < t) =∫ t

0fi,j(δ)dδ. (4.4)

Suppose that we extend i→ j by a POI k, the probability of traveling time of i→ j → k less than

t is given by the multiple integral:

P (χ < t) =∫∫

Dfi,j(δ)fj,k(γ)dδdγ, (4.5)

where the domain D = (δ, γ) ∈ R2>0 : 0 < δ + γ < t and R>0 means positive real number. In

general, for any route P : i → j · · · j′ → k with c sub-routes, the probability of the total traveling

time χ less than t is estimated by

P (χ < t) =∫∫· · ·∫Dfi,j(δ1) · · · fj′,k(δc)dδ1d · · · dδc, (4.6)

where D = (δ1, · · · , δc) ∈ Rc>0 : 0 < δ1 + · · ·+ δc < t. This probability can be computed given

all the probability density functions fi,j · · · fj′,k.

Theorem 5. [Completion probability based pruning] Let χ be the total traveling time of a route

P : i → j · · · → j′ and let χ′ be the total traveling time of another route P ′ : i → j · · · j′ → k

obtained by adding a new POI k to the route P . Assume that both routes start at i at the same time.

P (χ′ < t) ≤ P (χ < t).

Proof. We first consider a simple case of route P = i→ j and the extension P ′ = i→ j → k. Due

to the independence of traveling time at different sub-routes, P (χ′ < t) =∫∫D fi,j(δ)fj,k(γ)dδdγ,

so P (χ′ < t) =∫ t

0 fi,j(δ)∫ t−δ

0 fj,k(γ)dγdδ ≤∫ t

0 fi,j(δ) · 1dδ = P (χ < t). Similarly, for the

general case P = i→ j · · · j′−1→ j′ and the extension P ′ = i→ j · · · j′ → k, according to Eqn.

(4.6),P (χ′ < t) =∫∫· · ·∫D fi,j(δ1) · · · fj′,k(δc)dδ1d · · · dδc ≤

∫∫· · ·∫D− fi,j(δ1) · · · fj′−1,j(δc−1)·

1dδ1d · · · dδc−1 = P (χ < t), whereD− = (δ1, · · · , δc−1) ∈ Rc−1>0 : 0 < δ1+· · ·+δc−1 < t.

In other words, the probability of finishing a route within the time budget is never increased by

extending the route with one more POI at the end. This is because adding a POI at the end of a route

does not affect the traveling time between the previous POIs of the route, but reduces the chance of

completing the route within the time budget due to the additional time of traveling to and visiting

the new POI. We shall use this property to prune the trips that have their completion probability

below the specified threshold θ.

48

Page 60: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Various studies and methods have been proposed to estimate travel time distributions in the liter-

ature, such as log-normal distribution, gamma distribution, mixture of normal distributions, burr-XII

distribution with the consideration of the levels of services (defined with respect to road capacity

and critical density), etc. While each of these distributions has its own pros and cons and its perfor-

mance relies on the specific application, the comparison of them is not the focus of our work. See

[39] for a detailed comparison. Our method does not depend on the choices of such distribution,

provided that the probability P (χ < t) can be computed for a route P . For concreteness, we adopt

the log-normal distribution in [106]. The traveling time tz on the zth sub-route in a route indepen-

dently follows the log-normal distribution with parameter µz, σz , that is, tz ∼ LN (µz, σ2z). The

expected traveling time E[tz] is given by exp(µz + σ2z/2). The total traveling time χ of a route P

made of multiple sub-routes is the sum of the traveling time tz of each sub-route, i.e., χ =∑z tz .

According to [73], χ can be approximated by another log-normal distribution LN (µχ, σ2χ) with the

following parameters:

σ2χ = log

(∑e2µz+σ2

z (eσ2z − 1)

(∑eµz+σ2

z/2)2 + 1)

µχ = log(∑

eµz+σ2z/2)−σ2χ

2 .

(4.7)

With this distribution for the total traveling time χ, the probability of completing a trip P is

ψ(P) = P (χ < t), where t is the time available for traveling, that is, t = b −∑mj . For the

log-normal distribution,

P (χ < t) = 12

[1 + erf

(ln t− µχσχ√

2

)], (4.8)

where erf(t) = 2√π

∫ t0 e−δ2

dδ [106].

Note that the estimation of the probability of completing a trip, P (χ < t), is not costly. Because

we assume the parameters µz and σz of the distribution for tz are given, the estimation of the

parameters µχ and σχ of the other distribution for χ, in Eqn. (4.7), takes no time. Then, with the

result of µχ and σχ, the computation of Eqn. (4.8) takes no time.

4.3.4 Modeling POI Categories

In some scenarios, each POI has an explicit category. For example, some tourism websites may

categorize POIs into different classes. We can simply apply these category information. However,

in most cases, these explicit categories are not available, therefore, requiring us to infer the category

information from the contend description of each POI. Let’s first discuss how the POI categories are

inferred.

We classify the POI categories by LDA [9], a basic topic modeling method. The intuition behind

LDA is that documents exhibit multiple topics which are represented by distributions over words.

Since each POI i associates with a bag of words Wi, we can directly adopt LDA to infer the topic

distribution θi for each POI i. Refer to [9] for more details.

49

Page 61: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

θi is a probabilistic mixture of latent topics and each dimension θi(z) represents the probability

for a certain topic z. Then the category of POI i, denoted as φi, is defined as follows:

φi = arg maxzθi(z). (4.9)

The POI diversity constraint requires that a feasible route P consists of at least β categories of

POIs. Namely, |⋃i∈P φi| ≥ β. When β = 1, every route always satisfies the diversity constraint.

4.4 Optimal Method: State Expansion

As Problem 2 is NP-hard, there is no polynomial time exact algorithm for it unless NP = P. Thus, a

brute-force search of all candidate trips is prohibitive. However, most of these candidate trips do not

follow the POI availability or match user’s preferences, or cannot be finished within a given time

limited. That is, the complexity of this problem is practically compounded by the user’s preferences

and the constraints. Thus, an efficient search algorithm with strategy that prunes the infeasible and

non-optimal trips based on user preferences, POI availability, traveling time uncertainty is essential

for scaling a solution to large applications.

In this section, we present a state expansion algorithm that guarantees to find an optimal route

if it exists. The idea is to consider each partially generated route as a state associated with some

ending POI i, representing a trip route x → · · · → i → y that has i as the ending POI before

reaching the specified destination y. Each state is labeled by s = (K,H,Z, T,P, i), where K is the

set of POIs already visited, excluding x and y, H is the overall happiness collected (i.e., F (K,u)),

Z is the set of categories covered by the POIs in K, T is the starting time of visiting at i (i.e., πi),

P is the current route x → · · · → i (without the sub-route i → y) and i is the ending POI. These

parameters are denoted as sK , sH , sZ sT , sP and si, respectively. Initially, there is only one state

s0 = (∅, 0, ∅, T0, x, x), representing the trip route x→ y.

At the κth iteration (κ > 0), the state expansion algorithm extends each state of size κ− 1 into

a new state of size κ by adding a new POI. Specifically, a state s = (K,H,Z, T,P, i) is extended

into a new state s′ associated with POI j 6∈ sK ∪ x, y according to the following rules:

s′K = sK ∪ js′H = sH + r∗ujs′Z = sZ ∪ φjs′T = maxsT +mi + E[ti,j ], Ojs′P = sP → j.

(4.10)

A new state s′ is feasible if Sat(i, j, sT ) returns true. Intuitively, this means that the partial route

of the state can be extended to j and then finished at the destination y within the time budget with

the completion probability no less than the threshold θ.

50

Page 62: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

4.4.1 Dominance of States

It is possible that the same ending POI si could be reached by different states s of the same POIs

sK , corresponding to different visiting orders. Not all such states need to be maintained because

some do not lead to the optimal solution.

We say that a state s dominates a state s′ if

(si = s′i) ∧ (sK = s′K) ∧ (sT ≤ s′T ) ∧ (ψ(sP) ≥ ψ(s′P)), (4.11)

where · forms the complete trip by adding y into the route, say P = P → y. Note that sK = s′Kimplies sH = s′H and sZ = s′Z , i.e., s and s′ give the same user happiness and set of categories.

Intuitively, s dominates s′ if all of the following conditions hold: the two states s and s′ represent

two routes P → y and P ′ → y containing the same set of POIs, the starting visit time of i in s is

no later than that in s′, and the completion probability of s is no less than that of s′. Please note that

the dominance applies also for the case of a “tight”, i.e., all the terms in Eqn. (4.11) are equations.

Thus, in the case of a “tight”, the later extended state dominates a earlier extended one. We assume

that the procedure Check tests the dominance: Check(s, s′) returns true if s dominates s′ (i.e.,

Eqn. (4.11)) and false otherwise.

Lemma 1. If a state s dominates a state s′ and let se and s′e denote the states obtained from

extending s and s′ with a new POI j at the end, respectively, then se dominates s′e.

Proof. Suppose that s and s′ represent the routes P → y and P ′ → y. Then se and s′e represent the

routes P → j → y and P ′ → j → y. It is easy to see that the first three conditions in Eqn. (4.11)

remain true for se and s′e. To see the last condition, since s dominates s′, both P and P ′ have the

same ending POI i. If we regard i as the new source of the following identical trip i → j → y for

both se and s′e, the completion probability of this trip in se is no less than that in s′e because it starts

earlier in se. Combined with the previous trip P and P ′, this condition still holds.

By repeatedly applying Lemma 1, we have the next theorem.

Theorem 6. [Dominance based pruning] Assume that a state s dominates a state s′. If s′ can be

extended into an optimal trip by a sequence of POIs, so is s by the same sequence of POIs.

From the above theorem, it suffices to consider only non-dominated states. We will use this

property to remove all dominated states without affecting optimality. Note that it’s possible that

there are more than one optimal solutions.

4.4.2 Algorithm

Algorithm 2 summarizes the state expansion for TripRec. Starting at the initial state S = s0,the algorithm extends the current set of states, S, by adding one new POI at the end of a route in

S. If the states in S have the size κ, the new states in S′ have the size κ + 1. The two for loops

51

Page 63: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

extend each state in S with an unvisited POI and only feasible states are kept. Meanwhile, Line9-10 conducts the dominance test and removes dominated states. Line 12-13 records the optimal

route with the maximal user happiness, and the optimal route must fulfil the diversity threshold β.

The time complexity of Algorithm 2 is O(n22n), which is exponential but much faster than the

brute-force search O(n!). However, due to the time budget and POI availability constraints, each

trip typically consists of only a small fraction of all POIs. If the maximum number of POIs in a

trip is τ , where τ n, 2n is replaced with(nτ

)in the above complexity. The diversity constraint

β is one of the necessary conditions used to check whether a trip is considered to be optimal. If

β is not checked while the trips are constructed, we need to first keep any trips who meet the

other constraints and rank them in descending order as the overall happiness, and then check the

satisfaction of diversity constraint for each kept trip from the top ranked to the bottom ranked in an

additional post-processing phase until one trip meeting constraint β is found. In this case, unknown

number of trips (could be very large) needs to be maintained during constructing the trips, which is

a huge overhead.

Algorithm 2: State expansioninput : POI graph G, user u’s specific preferences r∗ui for each POI i, departure time T0,

time budget b, diversity threshold βoutput: optimal TripRec trip route, P∗

1 s0 ← (∅, 0, ∅, T0, x, x), s∗ ← s0;2 S ← s0, S′ ← ∅;3 while S 6= ∅ do4 for s ∈ S do5 for j ∈ V \ sK do6 if Sat(i, j, sT) then // i ≡ si7 s′T ← maxsT +mi + E[ti,j ], Oj ;8 s′ ← (sK ∪ j, sH + r∗uj , sZ ∪ φj, s′T , sP → j, j);9 if ∃s′′ ∈ S′ : Check(s′, s′′)=true then

10 remove s′′ from S′;

11 add s′ to S′;12 if s′H > s∗H and |s′Z | ≥ β then13 s∗ ← s′;

14 S ← S′, S′ ← ∅;15 return s∗P → y as P∗

4.5 Optimal Method: Prefix Based Depth-first Search

If the states of size κ are represented by the nodes at level κ in a tree structure (with the root at

level 0), Algorithm 2 generates the states in a breadth-first manner in that the states at level κ are

generated before any state at level κ + 1 is generated. For loose time budget and POI availability

52

Page 64: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

8D4C1A 2B

3ABABA→B

B→A5ACAC

A→C

C→A6BCBC

B→C

C→B9AD

A→D

D→A10BDBD

B→D

D→B

C→D12CDC→D

CDD→C

AB→CAB→C7ABCAC→B

BC→A

11ABD

AB→D

AD→B

BD→A

13ACD

AC→D

AD→C

CD→A

14BCD

BC→D

BD→C

CD→B

15ABCD

ABC→D

ABD→C

ACD→B

BCD→A

Figure 4.2: Prefix based depth-first compact state enumeration tree. The number indicates the orderof enumeration.

constraints, this approach may have to keep many “open” states in memory (i.e., all states of the

same size), which imposes a bottleneck on the memory requirement. To address this limitation, we

present a prefix based depth-first search (PDFS) method against compact states, of which the idea

is shown below.

4.5.1 Prefix Based Depth-first Search

Compact states C. We observe that several fields in a state s, i.e.,K - the set of POIs already visited,

H - the overall happiness collected, Z - the set of categories covered by the POIs in K, depend on

the POI set of the current route P but are independent of how the POIs are ordered. Hence, we can

group all the states (routes) sharing the same K as a compact state, denoted as C, and let CL denote

the list of these routes having C as the POI set and each such route P in CL has its own associated

T and i because of the different order of POIs. Then, C is associated with the following fields:CH : the overall happiness collected for the routes grouped by CCZ : the set of categories covered by the POIs in the routes grouped by CCL : ∀P ∈ CL,P is associated with T and i.

(4.12)

It will save a lot of memory if we store the information for the original states as such compact states

C. These information is cached in a hash map with C as the key. Next, we will introduce how to

enumerate the compact states.

For presentation, we consider V = A,B,C,D of four POIs (each capital letter represents a

POI), excluding the source x and destination y, with the POIs arranged in the lexicographical order

of POI IDs. The compact states are enumerated as the subsets of V represented by the nodes of a

tree. x is included in every compact state, so we omit x.

53

Page 65: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Figure 4.2 shows a compact state enumeration tree for V . Each node represents a compact state.

We define the set of POIs that precede i, in the above order, in a POI set as the prefix of a POI i, e.g.,

prefix of C is AB. The compact states are generated in a specific prefix-first depth-first manner

so that longer routes are extended from earlier computed shorter ones. Initially, the root is the empty

set ∅. A child node C of the current node C− is generated by appending a POI i that precedes any

POIs in C− to the front of C−, and all child nodes are arranged by the order of i. For example,

Node 7 ABC is generated as a child node of Node 6 BC by appendingA to the front of BCbecause A precedes B and C.

At node C, the routes in CL are generated by extending the cached routes in every compact

state C−j = C \ j where j ∈ C. There are |C| such C−j . We generate each route P = P− → j

by selecting the routes P− from each C−jL and append j at the end, and compute the happiness

and cost of P based on the accessed information for C−j from the hash map. P is kept in CL if it

feasible, namely it satisfies the time budget and POI availability constraints, which can be tested by

the procedure Sat.

For example, to generate the routes at the node ABC, we access the cached routes at nodes

AB, AC and BC and append the missing POI. AB → C represents all the routes ended

with C and the first two POIs in any order, i.e., x → A → B → C and x → B → A → C. Note

that it materializes only the current expanded branch of the tree, instead of the entire tree.

Then a complete route P → y for each P ∈ CL is used to check whether it is feasible and

is the best route found so far. If CL is empty, this compact state is not kept. If no compact state is

expandable, we stops the enumeration and yield the found best route.

So far, each CL can include |C|! routes, then we discuss how to embed the dominance based

pruning introduced in Section 4.4.1 into the prefix based depth-first compact state enumeration.

At the compact state C, when generating P = P− → j for a given j, we actually only need to

select the route P− from C−jL such that P is feasible and has the least cost T , thus, dominates all

other routes P ′− → j with P ′− from C−jL . This reduces |C|! routes to at most |C| dominating routes

at the compact state C, one for each j in C.

For example, AB → C on node 7 ABC now represents only one route with the least cost

chosen from A→ B → C and B → A→ C. Note that the dominance based pruning in this novel

enumeration approach still performs as a subtree pruning, e.g., if A → B on node 3 is pruned, all

the routes starting with A → B, such as A → B → C on node 7 and A → B → D on node 11,

will never be considered.

Subset first property. An important property of the above prefix based depth-first enumeration

is that a compact state C is always enumerated before any of its supersets. For example, the proper

subsets of ABC are enumerated at Nodes 0-6 and ABC is enumerated at Node 7. This property

ensures that, when computing the feasible routes at C with the ending POI j, the feasible sub-routes

in C−jL are guaranteed have already been computed earlier.

This PDFS enumeration tree here is somewhat inspired by the subset-first depth-first (SFDF)

tree in our last work presented in Chapter 3, which is used to partition attributes to generate group

54

Page 66: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Algorithm 3: PDFS(C−, I) (Recursive function)Global Input: POI graph G, user u’s specific preferences r∗ui for each POI i, time budget b,

diversity threshold βParameters : compact state C− and the set of POIs IOutput : optimal TripRec trip route P with its overall rating H∗

1 for j ∈ I in order do2 C← j ∪ C−;3 CH ← C−kH + r∗uk;4 CZ ← C−kZ ∪ φk;5 for k ∈ C do6 C−k ← C \ k;7 P−← the route in C−kL with ending POI i and T such that Sat(i, k, T ) = true and

P− → k is non-dominated;8 P ← P− → k;9 T ′ ← maxT +mi + E[ti,k], Ok;

10 if Sat(k, y, T ′) = true then11 if CH > H and |CZ | ≥ β then12 update P∗ andH∗;13 add P to CL;

14 PDFS(C, prefix of j in I);

relationships. In addition to the difference of applications, one major difference is that the PDFS

tree is proposed to deal with trips consisting of POIs in different orders, however, the SFDF tree

deals with partition of attributes and the order does not matter. Besides, the ways of materializing

each node and the pruning methods used are completely different.

4.5.2 Algorithm

We implement the above prefix depth-first search in Algorithm 3 as a recursive procedure PDFS(C−, I).

The current compact state C− and the POI set I available for extending C− are the parameters. The

inputs to the algorithm are the POI graph G, the departure time T0, the time budget b, the diversity

threshold β and user-specific preferences r∗ui. The output is the optimal route and its happiness,

stored in the global variables P∗ and H∗. The main algorithm is the call PDFS(∅,V) with the set

of POIs V in the POI graph G.

The algorithm extends the label C− by each available POI j in I , creating the child node with

the label C← j ∪ C− and computing the corresponding happiness and covered categories for C(Line 1-4). Line 5-13 adds all non-dominated feasible routes having the POI set C to the route list

CL, and may update the currently found best route P∗ and its happiness H∗. In particular, for each

k ∈ C, Line 7 searches for the non-dominated feasible route P for the POI set C and ending at k.

The time of visiting at k (the ending POI of P) is then computed, if P is found. Then if the complete

route P → y is feasible, we add P to CL, and check whether P is potential to be the current best

55

Page 67: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

route (Line 10-13). At last, a recursive call of the algorithm is made to extend C with the POIs in

the prefix of j in current I .

Note that the algorithm does not materialize the entire enumeration tree; instead, it enumerates

the nodes in the tree in the prefix based depth-first order. The result at each node is stored in the

hash map.

4.6 Heuristic Methods

4.6.1 State Relaxing

We presented an optimal method with state expansion in Section 4.4. The dominance based pruning

applies only to two states that have exactly the same set of POIs, i.e., sK = s′K . If we are willing

to sacrifice optimality for efficiency, it is possible to have a more aggressive pruning by replacing

the condition sK = s′K with |sK | = |s′K | (i.e., visiting the same number of POIs), sH ≥ s′H and

sZ ≥ s′Z (i.e., s representing a more preferred route than s′). So the dominance test condition in

Eqn. (4.11) is relaxed into

(si = s′i) ∧ (|sK | = |s′K |) ∧ (sH ≥ s′H) ∧ (sZ ≥ s′Z) ∧ (sT ≤ s′T ) ∧ (ψ(sP) ≥ ψ(s′P)). (4.13)

Intuitively, with this relaxed dominance relationship, the route for s takes less time, generates a

higher happiness and covers more POI categories than the route for s′, while reaching the same

ending POI i. In other words, the route represented by s gives the user more happiness, more re-

maining time and more diversity than the route represented by s′, thus, is preferred. We call the

pruning based on this relaxed dominance relationship state relaxing. State relaxing applies to all

states ending at the same POI through visiting the same number of POIs which significantly reduces

the size of the set of states S in Algorithm 2 as each ending POI may only be associated with a few

states. So the time complexity is decreased from O(n22n) to O(cn2) for some constant c.

However, due to the POI availability constraint, state relaxing loses the optimality in some cases.

For example, suppose A → D → C dominates A → B → C according to Eqn. (4.13) (we omit

the source x and destination y for simplicity), so the former is kept and the latter is eliminated. Now

suppose that B only opens in the morning and D opens until midnight. Then the route A → D →C → B may be infeasible due to the late visit to B while the route A → B → C → D could

be the optimal solution, but it cannot be generated because A → B → C was pruned. Section 4.7

will study experimentally the trade-off between efficiency and user happiness for the state relaxing

strategy.

4.6.2 Heuristic Insertion

In this section, we propose another simple heuristic algorithm that is essentially linear in the total

number of POIs while maintaining the quality of the route. The idea is intuitive: starting with the

initial trip route x → y, we insert one POI at a time between two adjacent POIs in the current trip

56

Page 68: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

route so that (i) the insertion preserves the satisfaction of all the constraints and (ii) some score of

the route is maximized (to be discussed shortly). For example, inserting a POI A into x → y gives

the route x→ A→ y, then inserting a POI B before A gives x→ B → A→ y, and so on. Let us

first ignore the POI diversity constraint, since it is never violated by inserting a POI into the current

trip route. We will discuss whether this algorithm can deal with the diversity constraint later. The

procedure is illustrated in Algorithm 4. Each calling of insert results in one additional POI in

the route, until it is impossible to add any new POI into the route without violating the constraints.

To avoid the local optimum, we generate some small number of routes (say 2-3) by applying this

method to the set of remaining POIs not contained in the previously generated routes, and we choose

the best route from all the routes generated. The time complexity of this algorithm is O(cn), where

n is the number of POI and c is a constant, because each insertion considers at most n unvisited

POIs. Note that the length of a route is usually small due to the time budget.

Algorithm 4: Heuristic Insertioninput : POI graph G, user u’s preferences r∗ui for each POI i, departure time T0, time budget

boutput: TripRec trip route, P

1 initialize the route P : x→ y;2 repeat3 P ′ ← Insert(P);4 P ← P ′5 until no more POIs in V can be inserted;6 P1 ← P;7 remove the POIs in the route P1 from V;8 generate another route P2 by repeating Line 2-5;9 if F (P1, u) > F (P2, u) then

10 P ← P111 else12 P ← P2

A remaining issue is to check whether inserting a POI k between two adjacent POIs i and j

(i.e., the sub-route i → j already exists in the trip) preserves the satisfaction of the time budget

constraint, the POI availability constraint and the completion probability constraint. We focus on

the POI availability constraint because it is easy to check the other two constraints. We assume

λij = 1, that is, a visit to POI i is followed by a visit to POI j. Before the insertion of k, the arrival

time at POI j, denoted by aj , is computed by

aj = πi +mi + E[ti,j ], (4.14)

where πi is the starting time of visiting at POI i. The wait time at POI j, wj , is computed by

wj = max0, Oj − aj. (4.15)

57

Page 69: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

The maximum allowed delay time at i to preserve the satisfaction of constraints, denoted by vi, is

computed by

vi = minCi − πi −mi, wj + vj, (4.16)

where Ci − πi −mi is the maximum allowed delay time to keep the visit to i available (before it

closes), and wj + vj is the maximum allowed delay time to keep the visit to j available.

For example, if πi = 10am,Ci = 2pm,mi = 1h, the maximum allowed delay time for i itself

is 3h, i.e., the user can at most delay to arrive at 1pm. However, a delay at i may affect the visit to

the next POI j. If wj + vj = 2h, that is, the visit to j can be delayed at most 2h, then the maximum

allowed delay time at i is vi = min3h, 2h = 2h.

The insertion of k between i and j is possible only if the new route satisfies the probability

constraint according to Eqn. (4.8) and the extra time caused by the insertion does not exceed the

maximum allowed delay time at j, i.e., wj + vj . The extra time εk for inserting POI k is given by

εk = E[ti,k] + wk +mk + E[tk,j ]− E[ti,j ]. (4.17)

If εk ≤ wj + vj , k can be inserted between i and j and the insertion transforms i → j into

i→ k → j, thus, λik = λkj = 1, λij = 0.

To determine the score of the insertion of k, we calculate the ratio γk as follows:

γk = (r∗uk)2/εk. (4.18)

This ratio measures the gain of happiness per unit of extra time of visiting k. The square of r∗ukplaces more emphasis on the rating. Since a smaller εk has less effect on the feasibility of the whole

route, the POI k with a larger ratio γk is preferred. We try every adjacent (i, j) in the current route

to find the best γk.

After each insertion, the arrival time, wait time, and maximum allowed delay time of all affected

POIs in the route should be updated according to Eqn. (4.14-4.16). For example, if k is inserted to

form a new route x→ i1 → i2 · · · → k → j1 → j2 · · · → y, the arrival time, the wait time and the

maximum allowed delay time of any POIs after k (j1, j2, · · · ) should be updated, and the maximum

allowed delay time of any POIs before k (i1, i2, · · · ) should be updated. Moreover, the updates must

follow the orders imposed by the dependency in Eqn. (4.14-4.16). For example, Eqn. (4.16) requires

first updating a later POI before updating an early POI in a route.

Then, we discuss whether the heuristic insertion algorithm can deal with the diversity constraint.

The answer is no. In each iteration, the algorithm chooses a POI to insert to maximize the ratio γk as

in Eqn. (4.18) while satisfying all the other constraints. But this ratio is not conditioned on the POI

categories. In some cases, the POIs to be inserted (has the maximum value of γk) in each iteration,

belong to the same category, as a result, the final recommended trip includes very few even only one

category of POIs. This trip probably fails to satisfy the diversity constraint.

58

Page 70: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

4.7 Experimental Evaluation

This section presents the empirical evaluation of the proposed methods.

4.7.1 Experimental Setup

We adopt the Yelp1 and Foursquare2 data sets in our experiments. Both data sets were previously

used for recommendation evaluation in [41]. The Yelp data set contains 45,981 users, 229,906 rat-

ings of 1-5 scales, 11,537 POIs, plus text reviews on POIs. We preprocessed the reviews by remov-

ing stop words and infrequent words occurring in < 100 reviews, and using the remaining 8,519

keywords as the features. The feature set or content of a POI, Bi, consists of all keywords contained

in the reviews about the POI. The Foursquare data set contains 20,784 users, 153,477 binary 0/1 rat-

ings, 7,711 POIs, and user published tweets when checking-in at a POI. We obtained 1,377 features

after preprocessing the tweets.

For each POI i, the touring time mi is set to 1 hour, and the opening hours were generated from

a Gaussian distribution, (Ci − Oi) ∼ N (µ, σ2) with the mean µ = 5 hours and the standard error

σ = 1. The open time Oi was generated using a uniform distribution, Oi ∼ U(8, 12). We set the

departure time T0 to 8am. The expected traveling time E[ti,j ] for a pair of POIs (i, j) is estimated

using Google Maps3 with the driving mode. We assume that Google Maps produces shortest paths

between POIs, therefore, the preprocessing step of computing the shortest paths, as stated in Section

4.3.2, is unnecessary. All the experiments were run on a PC with 2.53 GHz Quad-Core CPU and

12G memory.

4.7.2 Rating Accuracy of Individual POIs

First, we evaluate the first step of our approach, that is, the accuracy of estimated ratings of POIs

produced by the feature-centric collaborative filtering. For both data sets, we keep 90% rating data

for training to conduct matrix factorization and use the remaining 10% rating data for testing the

accuracy of estimated ratings. As in the literature [84], we use the standard RMSE (root mean

squared error) and MAE (mean absolute error) as the accuracy metrics for POI recommendation.

These two metrics are defined as follows:

RMSE =√√√√ 1N

∑u,i

(rui − r∗ui)2 (4.19)

MAE = 1N

∑u,i

|rui − r∗ui|, (4.20)

1http://www.yelp.com/dataset_challenge/

2https://foursquare.com/

3https://www.google.com/maps

59

Page 71: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

where rui is the true rating value, r∗ui is the predicted rating value, and N is the number of ratings

in the testing set. The smaller these values are, the better the result is.

Many POI recommendation approaches are based on topic modeling, for example, STM [41]

and LCA [115] predict the probability of visiting a POI, for which the error specific metrics such

as RMSE/MAE are incomputable because probabilities are not comparable with ratings. For this

reason, we evaluate the following methods.

Probabilistic matrix factorization (denoted PMF): This is the classic matrix factorization on

the user-item rating utility matrix [83] where POIs are treated as items. In PMF, matrix factorization

is generalized as a probabilistic model where a latent user vector pu ∼ N (0, α−1p ID), a latent item

vector qi ∼ N (0, α−1q ID). The predicted user u’s rating on item i is given by r∗ui = pTu qi. We adopt

the default settings in [83] and set D = 10, the dimensionality of user and item latent factors.

Collaborative topic regression (denoted CTR): This is the matrix factorization with topic

modeling applied to the content of items described in [99]. For our data sets, items are POIs and

content of user reviews on POIs. LDA is employed on POI i’s content to learn the latent topic vector

θi, which is incorporated into the PMF framework to confine the search of latent item vectors by

setting qi ∼ N (θi, α−1q ID). We adopt the default settings in [99] and set D = 10.

Feature centric collaborative filtering (denoted FCF): This is the proposed algorithm in Sec-

tion 4.3.1. All the parameter settings are the same as in PMF.

Table 4.2 shows the results of accuracy of the above three methods. FCF achieves the best

performance and has a significant improvement in terms of RMSE/MAE on both data sets. The im-

provement of FCF against any baseline on RMSE is measured by the equation (RMSE(baseline)−RMSE(FCF ))/RMSE(baseline)∗100%. The improvement on MAE is computed in the same

way. So we believe that the estimated rating by FCF is closer to the true rating.

Table 4.2: RMSE and MAE. Lower values are better

Method Yelp FoursquareRMSE MAE RMSE MAE

PMF 1.3169 1.0491 0.6197 0.5160CTR 1.2850 1.0277 0.6000 0.5018FCF 1.2152 0.9720 0.5154 0.4402

improvement of FCF over PMF 7.7% 7.3% 16.8% 14.7%improvement of FCF over CTR 5.4% 5.4% 14.1% 12.2%

t-Test. To further verify the statistical significance of the improvement introduced by FCF, we

conducted the paired t-Test (2-tail) on FCF and the two baselines. Table 4.3 shows that all p-values

in the t-Test results are less than 0.01, which suggests that the improvement of FCF over PMF and

CTR is statistically significant. In the rest of the experiments, we study the performance of trip

recommendation with the estimated rating r∗ui being generated by FCF.

60

Page 72: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Table 4.3: Paired t-Test(2-tail) of FCF and baselines

Method Yelp FoursquareRMSE MAE RMSE MAE

FCF/PMF 7.4× 10−6 4.2× 10−6 6.7× 10−7 2.1× 10−6

FCF/CTR 1.1× 10−5 1.2× 10−5 2.0× 10−6 3.8× 10−6

4.7.3 The Fixed Traveling Time Model Without Diversity Constraint

In this section, we evaluate the trip route P found by TripRec under the fixed traveling time model

without considering the diversity constraint, where the traveling time ti,j for a pair of POIs i and j is

fixed and the diversity threshold is set as a special case β = 1. The reason we use this deterministic

setting is that all the baselines consider fixed traveling time and none of them has a constraint on

the minimum number of categories covered by the trip. In this deterministic setting, a feasible route

always satisfies the completion probability constraint and the diversity constraint. The model for

uncertain traveling time will be considered in Section 4.7.4, and the effect of diversity constraint is

shown in Section 4.7.5.

We focus on three major cities for trip planning, Phoenix (PX) in Yelp, New York city (NY) and

Los Angeles (LA) in Foursquare, and choose Central City, Central Park, and Hollywood as both

the source and the destination in these cities respectively. For each city, we randomly pick up 100

users from the testing data, and for each user, we select the top n = 150 unvisited POIs, ranked by

their estimated ratings, for trip recommendation. This n is a suggested number in [6]. Even with this

restriction, the number of trips that consist of 5 POIs can reach billions, which is certainly infeasible

for a brute-force search. We compare the following methods in terms of user happiness F (P, u) and

runtime. All the methods adopt the personalized estimated ratings for each POI, learnt by FCF as

input.

Greedy algorithm (denoted Greedy): This is the greedy algorithm from the operation research

literature [97], which iteratively picks up a POI j with the highest ratio of r∗j/ti,j , where i is the

location selected at the last step. Note that we have added the POI availability constraint, which is

not considered by [97].

Dynamic programming (denoted DP): This is the dynamic programming approach proposed

in [34]. We adapt to the order constraint by setting a “global” category to each POI and fix the vis-

iting order that is from “global” category to “global” category. However, the dynamic programming

by filling up a 2-dimensional array [34] still cannot deal with the POI availability constraint.

Heuristic Insertion (denoted HI): This is the heuristic algorithm proposed in Section 4.6.2. HI

is designed for fast search and does not guarantee the optimality of solution.

State expansion (denoted SE): This is Algorithm 2 proposed in Section 4.4. Let SE-SR denote

SE with state relaxing, presented in Section 4.6.1. SE guarantees the optimality of solution, but

SE-SR does not.

Prefix based depth-first search (denoted PDFS): This is Algorithm 3 in Section 4.5 that uses

the prefix based depth-first enumeration of POIs. PDFS guarantees the optimality of solution.

61

Page 73: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

6

8

10

12

14

16

18

20

22

5h 6h 7h 8h 9h

Happin

ess

Time budget

GreedyDPHI

SE-SRPDFS/SE

10-3

10-2

10-1

100

101

102

103

5h 6h 7h 8h 9h

Runtim

e (

s)

Time budget

GreedyDPHI

SE-SRPDFS

SE

(a) PX

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

5h 6h 7h 8h 9h

Happin

ess

Time budget

10-3

10-2

10-1

100

101

102

103

104

5h 6h 7h 8h 9hR

untim

e (

s)

Time budget

(b) NY

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

5h 6h 7h 8h 9h

Happin

ess

Time budget

10-3

10-2

10-1

100

101

102

103

5h 6h 7h 8h 9h

Runtim

e (

s)

Time budget

(c) LA

Figure 4.3: The fixed traveling time model: (left) happiness of trip routes found (y-axis) vs timebudget (x-axis); (right) average runtime (y-axis) vs time budget (x-axis).

User happiness

Let us recall that the happiness of user u with respect to route P is defined as F (P, u) =∑i∈P r

∗ui,

which sums up the estimated ratings r∗ui for all POIs in route P . Figure 4.3 (left column) presents

the user happiness score of the trips found by all methods, with y-axis being the happiness score

averaged over all testing users and x-axis being the time budget b of a trip (hours). Note that SE and

PDFS generate exactly the trips of the same happiness score due to their optimality.

Overall, the number of POIs in the recommended route varies from 3 to 7 depending on the

setting of the time budget b. As the time budget increases, the happiness of users generally increases.

PDFS/SE is the best performer since they guarantee the global optimum. Interestingly, SE-SR yields

a nearly optimal solution as the happiness is only slightly (< 1%) lower than that of the optimal

62

Page 74: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Table 4.4: Pruning power of the SE/PDFS algorithms in various time budget settings, the numbersare the percentage of the pruned states by the algorithms

PX NY LA5h 99.30% 99.29% 99.25%6h 99.23% 99.26% 99.20%7h 99.20% 99.22% 99.17%8h 99.18% 99.21% 99.15%9h 99.16% 99.19% 99.11%

PDFS/SE. SE-SR appears close to the optimal solutions because we select top 150 POIs for each

user and the rating on these 150 POIs likely have minor differences, therefore, in many cases, the

happiness for the trips consisting of same number of POIs and ending POI are close.

HI performs in the third place and there is an obvious gap between HI and the best two. This is

because HI only maintains one route during search, which makes it easy to fall into a local optimum.

We will further explain this in the case study below. Greedy performs about 10% worse than HI, as

its search strategy is rather simple. DP performs poorly on on all the testing cities, because it cannot

deal with the POI availability constraint. In fact, only partial happiness is gained for such routes that

some of the POIs are already closed when the user arrives, thus, leading to the low happiness scores

for many users. Meanwhile, DP cannot guarantee a better result for a larger b.

Runtime

Figure 4.3 (right column) presents the average runtime per user, with y-axis being the runtime

(seconds) and x-axis being the time budget of a trip b (hours). HI and Greedy have a fast and stable

runtime because both HI and Greedy only maintain one route, but this feature also overlooks other

possible combinations of POIs, thus hardly finding optimal solutions. SE suffers the out of memory

problem when the time budget is over 7 hours because there are too many “open” states in each

iteration, which exhausts the memory when the time budget is large. PDFS avoids this problem by

the prefix based depth-first enumeration. Although it takes the longest time at b = 5h, PDFS finds

the optimal solution without having the steep increase of runtime encountered by SE. SE-SR takes

substantially less time than SE by trading optimality for efficiency.

Pruning power

The algorithms state expansion (SE) and prefix based depth-first search (PDFS) apply the same

pruning strategy, i.e., dominance based pruning as in Theorem 6, to reduce the potential state number

when computing the optimal trip. In this section, we investigate how powerful the pruning strategy

is. We take the SE algorithm as an example for illustration. Let Sall denote all the states that can

be enumerated using the POI set in each iteration, i.e., the enumeration space, and let Skept denote

the set of the states that are feasible (by checking whether the function Sat() returns true) and are

non-dominated. Explicitly, Sall represents the states enumerated by Line 3- Line 5 in Algorithm 2,

63

Page 75: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

(a) PDFS (5.72) (b) HI (4.89)

Figure 4.4: Case study of recommended trips for LA, with the happiness of each trip in bracket.

and Skept represents only the states that go deeper and pass the check of Line 9. Then, the pruning

power, ι, is defined as:

ι = |Sall| − |Skept||Sall|

× 100%. (4.21)

Table 4.4 shows the experimental study of the pruning power over the 3 datasets. As we can see

from the results, for all the datasets, over 99% of the states are pruned by our SE/PDFS algorithm,

in other words, only around 0.8% of the states are kept. Therefore, the search space and runtime

of searching the optimal solution being greatly reduced. Another fact we can observe is that as the

time budget (b) increases, the pruning power becomes weaker. For instance, for the PX data set,

ι = 99.16% when b = 9h, which is smaller than ι = 99.30% when b = 5h. The reason is that

longer time budget allows more POIs being visited in a trip, more trip routes become feasible and

cannot be dominated owing to the longer budget.

Case study

For a randomly selected user with b = 8h, Figure 4.4 shows the trip routes designed by PDFS and

HI, on the local map of LA, where Location 1 is the source and the destination. The visit follows

the increasing order 1 → 2 → · · · → 1. PDFS and HI share many POIs in their recommended trip

routes (e.g., POIs 2, 3, 6, 7 for PDFS) due to that both methods adopted the personalized preferences.

However, HI maintains only one route and easily falls into a local optimum. For example, while

POIs 1 and 3 are spatially far away from POIs 2 and 6 in Figure 4.4(b), HI visits these POIs in the

order 1→ 2→ 3→ 6→ 1, in which every sub-route is between two POIs that are far away, thus,

too much time is spent on traveling. In contrast, PDFS designs the route in a circle, which reduces

the number of sub-routes with long traveling time and allows the user to visit one more POI than HI

within the same time budget.

In summary, PDFS finds the optimal solution with less runtime than SE; SE-SR is a very good

trade-off for efficiency at a slightly lower happiness than the optimal solution; HI is very efficient but

64

Page 76: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

6

8

10

12

14

16

18

20

22

5h 6h 7h 8h 9h

Happin

ess

Time budget

SE-SR θ=0.7θ=0.8θ=0.9

PDFS θ=0.7θ=0.8θ=0.9

10-3

10-2

10-1

100

101

102

103

5h 6h 7h 8h 9h

Runtim

e (

s)

Time budget

SE-SR θ=0.7θ=0.8θ=0.9

PDFS θ=0.7θ=0.8θ=0.9

(a) PX

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

5h 6h 7h 8h 9h

Happin

ess

Time budget

10-2

10-1

100

101

102

103

104

5h 6h 7h 8h 9hR

untim

e (

s)

Time budget

(b) NY

2

2.5

3

3.5

4

4.5

5

5.5

5h 6h 7h 8h 9h

Happin

ess

Time budget

10-2

10-1

100

101

102

103

5h 6h 7h 8h 9h

Runtim

e (

s)

Time budget

(c) LA

Figure 4.5: The uncertain traveling time model: (left) happiness of trip routes found (y-axis) by SE-SR and PDFS vs time budget b (x-axis); (right) average runtime (y-axis) vs time budget b (x-axis).

sometime has a significantly lower happiness. Overall, PDFS and SE-SR are two best performers

considering both quality and efficiency.

4.7.4 The Uncertain Traveling Time Model

In this section, we study the effect of the uncertain traveling time on SE-SR and PDFS. The diversity

threshold, β, is still set as 1. For the traveling time distribution of ti,j for a sub-route i → j,

we adopt the log-normal distribution ti,j ∼ LN (µij , σ2ij) in Section 4.3.3. Note that E[ti,j ] =

exp(µij + σ2ij/2). σij is generated from a uniform distribution to introduce the uncertainty, i.e.,

σij ∼ U(0.5, 2).

65

Page 77: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Table 4.5: Comparison of the averaged happiness over the 100 testing users for the case of β = 1 andβ > 1, gained from the optimal routes recommended by PDFS. The numbers in brackets indicatethat among the 100 generated optimal results with the constraint β = 1 for the testing users howmany are also optimal with the constraint β equals to 2, 3 or 4

bPX NY LA

β = 1 β = 2 β = 1 β = 2 β = 1 β = 25h 11.63 (100) 11.63 2.98 (99) 2.98 2.50 (55) 2.496h 12.96 (100) 12.96 3.81 (99) 3.81 3.26 (97) 3.257h 16.35 (73) 16.32 4.37 (96) 4.35 3.97 (97) 3.968h 19.39 (99) 19.38 4.99 (99) 4.99 4.66 (100) 4.669h 20.36 (100) 20.36 5.73 (100) 5.73 5.26 (100) 5.26

bPX NY LA

β = 1 β = 3 β = 1 β = 3 β = 1 β = 35h 11.63 (38) 11.61 2.98 (8) 2.91 2.50 (11) 2.476h 12.96 (26) 12.93 3.81 (71) 3.79 3.26 (4) 3.197h 16.35 (38) 16.30 4.37 (84) 4.34 3.97 (5) 3.928h 19.39 (97) 19.38 4.99 (6) 4.94 4.66 (13) 4.629h 20.36 (89) 20.35 5.73 (78) 5.72 5.26 (96) 5.23

bPX NY LA

β = 1 β = 4 β = 1 β = 4 β = 1 β = 45h 11.63 (0) - 2.98 (0) - 2.50 (0) -6h 12.96 (4) 12.17 3.81 (0) 3.77 3.26 (1) 3.197h 16.35 (0) 15.51 4.37 (3) 3.91 3.97 (1) 3.918h 19.39 (40) 19.37 4.99 (0) 4.83 4.66 (4) 4.599h 20.36 (19) 20.33 5.73 (23) 5.69 5.26 (38) 5.19

Figure 4.5 (left column) presents the happiness scores of SE-SR and PDFS with various thresh-

old θ on completion probability. A color represents a method and a pattern represents a threshold

θ on completion probability. Compared to the case for the fixed traveling time in Section 4.7.3, the

happiness of both methods become lower given the same time budget. For example, there is about

20%-40% reduction of happiness from the fixed time cases at θ = 0.9 and b = 5h. This is because

the route designed in the previous section, although having a higher happiness, may violate the

completion probability constraint due to the variance of traveling time, and a more strict constraint

(i.e., higher threshold) results in less happiness. In practice, if the user prefers a more reliability of

a trip, a route with higher completion probability but a bit less happiness is acceptable.

The uncertain traveling time model also accelerates the runtime of both methods, as shown

in Figure 4.5 (right column). As the completion probability threshold θ increases, there are fewer

feasible routes and both methods prune the routes with the probability below the threshold earlier.

A cross examination with Figure 4.3 indicates that at θ = 0.7, the runtime of these methods with

modeling uncertain traveling time is close to that with the fixed traveling time model. However, it is

almost an order of magnitude less in runtime at θ = 0.9.

66

Page 78: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

4.7.5 Effect of Diversity Constraint

The previous sections evaluated the methods for TripRec in the special case of β = 1, where the

diversity constraint β specifies the minimum number of categories of POIs in a trip. In this section,

we study the effect of the diversity constraint by comparing the results for the special case of β = 1and more general case of β > 1. The fixed travelling time model is applied. We consider the results

of PDFS only because SE generates the same results. Greedy, DP, and HI cannot deal with the

diversity constraint. As there are a total of 5 different categories of POIs, we vary β from 2 to 4.

Table 4.5 shows the averaged happiness values of PDFS for both the case of β = 1 and β > 1.

The numbers in brackets indicate, among the recommended routes for the 100 testing users with

the setting β = 1, how many routes also satisfy the specified diversity constraint with β equals to

2, 3 or 4. Note that if a route is optimal with the setting β = 1 but it actually contains k (k > 1)

categories, then the route is also optimal for the case of β = k. For example, for PX with β = 1, the

entry 16.35 (73) means that the averaged happiness of the 100 routes that have at least 1 category

(i.e., β = 1) is 16.35, and out of them, 73 consist of 2 or more categories of POIs. As we can see

in Table 4.5, for β = 2, there are many overlapped routes for β = 1 and β = 2 since a route has a

high chance to contain 2 categories of POIs. As β increases, there exists fewer overlapped routes,

so more gap on the happiness can be observed. At b = 5h, no route satisfies the diversity constraint

of β = 4 since at most 3 POIs can be visited within 5 hours.

4.8 Summary and Extensions

In this chapter, we formulated the personalized trip recommendation problem, which is NP-hard, to

retrieve a sequence of POIs that maximizes user’s satisfaction according to user’s historic activities

with various constraints including user’s time budget, POI availability and diversity, and uncertain

traveling time. We presented both optimal solutions and heuristic solutions to this problem. Our

evaluation on real life data sets suggested that PDFS is the most efficient algorithm for optimal

solutions and SE-SR improves efficiency at a slightly lower quality than optimal solutions.

Several variations are possible in the presented trip recommendation model. One variation is to

factor the touring time of a POI in the happiness score, that is, it is more important for a POI with a

longer staying time to be preferred by the user than a POI with a shorter staying time. We can also

factor the completion probability of a trip in the score, in addition to a threshold on the probability.

Another variation is adding a financial budget constraint of a user, in addition to the time budget,

assuming a cost for traveling and a cost for visiting a POI. Besides, the opening/closing hours can

also depend on the day of the week. This can be done by simply using the opening/closing hours at

the time when the POI is visited. In addition to the uncertainty of traveling time, we can also model

the uncertainty of the POI touring time via a separate random variable by the same way we did for

the traveling time tij , then can merge the uncertainty of the POI touring time and the uncertainty of

the travelling time together to estimate the completion probability of the whole trip. These variations

or extensions require only a minor modification to our current algorithms.

67

Page 79: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Chapter 5

Route Search with PersonalizedDiversity Requirement on POIs

5.1 Motivations and Contributions

The proposed personalized trip recommendation problem in Chapter 4 deals with some realistic

constraints, such as budget, POI’s availability, uncertain traveling time, etc. However, it has several

limitations. First, it infers user’s personalized rating for new POIs based on the user’s historical

ratings for other POIs. But it cannot deal with the cold-start problem where a user’s historical data

is very sparse or even not available, or the case that a user’s tastes or preference may dynamically

change over time. Second, it ideally assumes that each POI has single type and a user can specify

the minimum number of POI types the trip route has to cover, but the exact which types of POI are

not specified. As a result, the final recommended trip may include the type of POIs not desired by

the user on that day. Third, the happiness function only values “quantity” and simply adds up the

ratings for the POIs on a trip even though many POIs belong to the same type. But in reality, when

increasing consumption of the same type of product, people’s interests may decline.

A practical problem that has not been well studied is that, a user wants to be suggested a small

number of routes that not only satisfy her cost budget and spatial constraints, but also best meet

her personalized requirements on diverse POI features. We instantiate this problem with a travel

scenario. Consider that a new visitor to Rome wishes to be recommended a trip, starting from her

hotel and ending at the airport, that allows her to visit museums, souvenir shops, and eat at some

good Italian restaurants (not necessarily in this order) in the remaining hours before taking the flight.

She values the variety over the number of places visited, e.g, a route consisting of one museum, one

shop, and one Italian restaurant is preferred to a route consisting of two museums and two shops.

The above problem is actually generalizable to various route planning scenarios, and they illus-

trate some common structures and requirements. First, there is a POI map where POIs are connected

by edges with traveling cost between POIs, and each POI has a location and is associated with a

vector of features (e.g., museum) with numeric or binary ratings. The POI map can be created from

Google Map, and features and ratings of POIs can be created from user rating and text tips avail-

68

Page 80: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

( 0, 1, 0 )

( 0.8, 0, 0 )

( 0, 0.9, 0 )

( 0, 0.4, 0.6 )

( 1, 0, 0.2 )

( 0, 0, 0.9 )

14

5

3v6

v5

v3

v1

v2

v4

Feature vector ( , , )

Figure 5.1: A sample POI map. Each node vi represents a POI with 3 features (Park, Museum,Restaurant). Each feature having a numeric rating in the range [0, 1], indicated by the vector asidethe POI. Each edge has an associated cost of traveling the edge.

able on location-based services such as Foursquare, or extracted from check-ins and user provided

reviews [28]. Second, the user seeks to find top-k routes P1, · · · ,Pk, from a specified source x

to a specified destination y within a travel cost budget b, that have highest values of a certain gain

function Gain(PiV ) for the set of POIs PiV on the routes Pi. The user specifies her preference of

routes through a weight vector w with wh being the weight of a feature h, and a route diversity

requirement, which specifies a trade-off between quantity (the number of POIs with a preferred fea-

ture) and variety (the coverage of preferred features) for the POIs on a route. The gain function has

the form Gain(PV ) =∑hwhΦh(PV ), where Φh for each feature h aggregates the feature’s scores

of the POIs PV .

To better motivate the route diversity requirement, let us consider the sample POI map in Figure

5.1 and a user with the source and destination (x = v1, y = v5) and the travel cost budget b =18. The user weights the features Park and Museum using the vector w = (0.4, 0.6, 0) for Park,

Museum, and Food in that order, and values both quantity and variety. If the sum aggregation Φh

is used, the route v1 → v6 → v4 → v5 will have the highest Gain. However, this route may

not be preferred by the user because it does not include any park though it includes 3 museums.

With the max aggregation used, the route v1 → v3 → v5 has the highest Gain by including one

top scored museum and one top scored park, but this route does not maximally use all the budget

available. Intuitively, the sum aggregation is “quantity minded” but ignores variety, whereas the max

aggregation is the opposite; neither models a proper trade-off between quantity and variety as the

user considered. The above user more prefers the route v1 → v2 → v3 → v5 that visits multiple

highly scored museums and parks, which will better address both quantity and variety.

Solving the top-k route search problem faces two challenges.

Challenge I. One challenge is to design a general enough Φh that includes a large class of

aggregation functions to model a personalized route diversity requirement where each user has her

own quantity and variety trade-off. In particular, if we treat the satisfaction by visiting each POI as

the marginal utility, for many variety minded users, as they visits more POIs of the same feature, the

69

Page 81: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

marginal gain from such visits decreases gradually, i.e., diminishing marginal utility, besides, the

decreasing speed for different users varies. While for some other quantity minded users, they only

want to visit the “good quality” POIs as many as possible.

Challenge II. The top-k route problem is NP-hard as it subsumes the NP-hard orienteering

problem [17]. However, users typically demand the routes not only be in high-quality, even optimal,

but also be generated in a timely manner (seconds to minutes). The demand on both quality and

efficiency makes the design of the algorithm quite challenging. This task is further complicated

by the incorporation of a generic aggregation function Φh to model a personalized route diversity

requirement as motivated above.

To deal with Challenge I, our approach is modeling the aggregation of utilities of POIs with

the diminishing return property by submodular set functions Φh. Submodularity has been used

for modeling user behaviors in many real world problems [50][45]. To the best of our knowledge,

modeling user’s diversity requirement on a route by submodularity has not been done previously.

For Challenge II, fortunately, the users’ preferences and constraints on desired routes provide new

opportunities to reduce the search space of finding optimal top-k routes. For example, for a user

with only 6 hour time budget and preferring museums and parks on a route, all the POIs in other

types or beyond the 6 hours limit will be irrelevant. Besides, the key of an exact algorithm is to

prune, as early as possible, such irrelevant POIs as well as the routes that are unpromising to make

into the top-k list due to a low gain Gain(PV ). Therefore, in addition to an efficient search strategy

and a cost based pruning, we design a tight upper bounding strategy on Gain(PV ) that works for

any submodular aggregation function Φh for pruning unpromising routes. We incorporate these

techniques all together to search for optimal routes with fast responses.

While we presented an detailed review of related works in Section 2.2, [120] perhaps is the most

related one. It adopts a keyword coverage function to measure the degree to which a set of query

keywords are covered by a route, similar to ours. Their pruning strategies are designed specifically

for their specific keyword coverage function, thus, cannot address the personalized route diversity

requirement considered in our work, where a different submodular function may be required for

each user. Our algorithm framework and pruning strategies apply to any submodular function Φh.

Finally [120] produces a single route, and its performance is only “2-3 times faster than the brute-

force algorithm”, as pointed in [120].

Contributions

The main contributions of this work are:

• We define the top-k route search problem with a new personalized route diversity require-

ment where the user can choose any submodular function Φh to model her desired level of

diminishing return. As an instantiation, we show that the family of power-law functions is

a sub-family of submodular functions and can model a spectrum of personalized diversity

requirement. (Section 5.2)

70

Page 82: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

• Our first step towards an efficient solution is to eliminate irrelevant POIs for a query, by

proposing a novel structure for indexing the POI map on both features and travel costs. This

index reduces the POI map to a small set of POIs for a query. (Section 5.3)

• Our second step towards an efficient solution is to prune unpromising routes, by proposing a

novel optimal algorithm, PACER. The novelties of the algorithm include an elegant route enu-

meration strategy for a compact representation of search space and the reuse of computed re-

sults, a cost-based pruning for eliminating non-optimal routes, and a gain-based upper bound

strategy for pruning routes that cannot make into the top-k list. The algorithm works for any

submodular function Φh. (Section 5.4)

• While PACER is guaranteed to find the exact solution, it could become inefficient for a query

with a loose constraint. To deal with such loose queries, we present two heuristic algorithms

with a good efficiency-accuracy trade-off, by finding a good solution with far smaller search

spaces. (Section 5.5)

• We evaluate our algorithms on real-world data sets. PACER provides optimal answers while

being orders of magnitude faster than the baselines. The heuristic algorithms provide answers

of a competitive quality and work efficiently for a larger POI map and/or a looser query

constraint. (Section 5.6)

5.2 Problem Statements

We formally define the problem studied in this work. Table 5.1 summarizes the notations frequently

used throughout the work. The variables in bold-face are vectors or matrices.

Definition 5. [A POI Map] A POI map G = (V, E) is a directed/undirected and connected graph,

where V is a set of geo-tagged POI nodes and E ⊆ V × V is a set of edges between nodes (i, j),

i, j ∈ V . H is a set of features on POIs. F ∈ R|V|×|H| denotes the POI-feature matrix, where

Fi,h ∈ [0, β] is the rating on a feature h for the POI i. Each POI i ∈ V is associated with a staying

cost mi. Each edge ei,j ∈ E has a travel cost ti,j .

Note that the POI map defined here is different from the POI graph as is introduced in Section

4.2 in the last work. While an edge ei,j in the POI graph in the last work is an artificial edge

presenting the pre-computed shortest route from i to j, and the POI graph is a complete graph, the

POI map defined in this work is a more natural road network such that an edge ei,j presents the real

road connecting the adjacent POIs. The cost ti,j for the edge ei,j is given, but the shortest route from

i to j and the corresponding cost for the shortest path, Ti,j as will be defined shortly, is not given. In

addition, the choices of mi and ti,j depend on applications and can be time, expenses, or other cost.

Definition 6. [Routes] A route P is a path x → · · · i · · · → y in G from the origin x to the

destination y through a sequence of non-repeating POIs i except possibly x = y. PV denotes the

71

Page 83: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Table 5.1: Nomenclature

Notation InterpretationF ∈ R|V|×|H| POI-feature matrix F with POI set V and feature setH

Fi,h the rating on feature h ∈ H for POI i ∈ Vmi staying cost on POI iti,j the traveling cost on edge ei,j ∈ ETi,j the least traveling cost from any POI i to any POI jP , PV route P with the included POI set PV

Q =(x, y, b,w,θ,Φ)

user query with parameters:x and y – source and destination locationb – travel cost budgetw – feature preference vectorθ – filtering vector on feature ratingsΦ – feature aggregation functions

VQ POI candidates set VQ retrieved by QFi,h Fi,h after filtered by θ

Gain(PV , Q) gain of a route P given query QtopK the found top-k routes

set of POIs on P . Ti,j denotes the least traveling cost from i to the next visited j, where i, j are not

necessarily adjacent in the POI map. The cost of P is defined as

cost(P) =∑

i∈PVmi +

∑i→j∈P

Ti,j . (5.1)

A route P includes only the POIs i that the user actually “visits” by staying at i with mi > 0.

Each i→ j on a route is a path from i to j with the least traveling cost Ti,j . The intermediate POIs

between i, j on path i → j are not included in P . The staying times at x and/or y can be either

considered or ignored depending on the user choice. The latter case can be modeled by setting

mx = my = 0.

At the minimum, the user has an origin x and a destination y for a route, not necessarily distinct,

and a budget b on the cost of the route. In addition, the user may want the POIs to have certain

features specified by a |H|-dimensional vector w with wh being the weight of feature h, where

1 ≤ h ≤ |H|. wh ∈ [0, 1] and Σhwh = 1. The user can also specify a filtering vector θ so that

Fi,h is set to 0 if it is less than θh. Fi,h denotes Fi,h after this filtering. Finally, the user may specify

a route diversity requirement through a feature aggregation function vector Φ, with Φh for each

feature h. Φh(PV ) returns the aggregated rating on feature h over the POIs in PV . See more details

in Section 5.2.2.

Definition 7. [Query and Gain] A user query Q is a 6-tuple (x, y, b,w, θ,Φ). A route P is valid if

it starts from x and ends at y, and cost(P) ≤ b. The gain of P w.r.t. Q is

Gain(PV , Q) =∑

hwhΦh(PV ). (5.2)

72

Page 84: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Note that only the specification of x, y, b is required; if the specification of w,θ,Φ is not pro-

vided by a user, their default choices can be used, or can be learned from users’ travel records if such

data are available (not the focus of this work). Gain(PV , Q) is a set function and all routes P that

differ only in the order of POIs have the same Gain, and the order of POIs affects only cost(P).

5.2.1 Top-k Route Problem

Problem 3. [Top-k route problem] Given a query Q and an integer k, the goal is to find k valid

routes P that have different POI sets PV (among all routes having the same PV , we consider only

the route with the smallest cost(P)) and the highest Gain(PV , Q) (if ties, ranked by cost(P)). The

k routes are denoted by topK.

In the rest of this work, we use Gain(PV ) for Gain(PV , Q).

5.2.2 Modeling Route Diversity Requirement

To address the personalized route diversity requirement, we consider a submodular Φh to model

the diminishing marginal utility as more POIs with feature h are added to a route. A set function

f : 2V → R is submodular if for every X ⊆ Y ⊆ V and v ∈ V \ Y , f(X ∪ v) − f(X) ≥f(Y ∪ v)− f(Y ), and is monotone if for every X ⊆ Y ⊆ V , f(X) ≤ f(Y ). The next theorem

follows from [49] and the fact that Gain(PV ) is a nonnegative linear combination of Φh.

Theorem 7. If for every feature h, Φh(PV ) is nonnegative, monotone and submodular, so isGain(PV ).

We aim to provide a general solution to the top-k route search problem for any nonnegative,

monotone and submodular Φh, which model various personalized route diversity requirement. To

illustrate the modeling power of such Φh, for example, consider Φh defined by the power law func-

tion

Φh(PV ) =∑

i∈PVRh(i)−αhFi,h, (5.3)

where Rh(i) is the rank of POI i on the rating of feature h among all the POIs in PV (the largest

value ranks the first), rather than the order that i is added to P , and αh ∈ [0,+∞) is the power

law exponent for feature h. Rh(i)−αh is non-increasing as Rh(i) increases. For a sample route P =A(3)→ B(5) with the ratings of feature h for each POI in the brackets, andαh = 1, the ranks forA

andB on h areRh(A) = 2 andRh(B) = 1. Thus, Φh(PV ) = 2−1×3+1−1×5, with a diminishing

factor 2−1 for the secondly ranked A. If we use a larger αh = 2, Φh(PV ) = 2−2× 3 + 1−2× 5 has

a larger diminishing factor for A.

Figure 5.2 shows how Rh(i)−α varies as the rank increases for different α. In general, a larger

αh means a faster diminishing factor for the ratings Fi,h on the recurrent feature, i.e., a diminishing

marginal value on h. Note that the sum aggregation (αh = 0) and the max aggregation (αh = ∞)

are the special cases. Hence, Eqn. (5.3) supports a spectrum of diversity requirement through the

setting of αh.

73

Page 85: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

1 2 3 4 5 6 7 8

0

0.2

0.4

0.6

0.8

1

= 0

= 0.5

= 1

= 2

= 100

Figure 5.2: Rh(i)−α vs. rank Rh(i) for varied α

Note that Rh(i) for an existing POI i may decrease when a new POI j is added to P , so it is

incorrect to compute the new Φh(PV ) by simply adding the marginal brought by j to existing value

of Φh(PV ). For ease of presentation, we assume αh has same value for all h and use α for αh in

the rest of the work.

Theorem 8. Φh(PV ) defined in Eqn. (5.3) is nonnegative, monotone and submodular.

Proof. The nonnegativity and monotonicity of Φh(PV ) in Eqn. (5.3) is straightforward. We only

show its submodularity.

Let X and Y denote the set of POIs contained in two routes with X ⊆ Y . The POIs in X

and Y are arranged in descending order of Fi,h. Consider v ∈ V \ Y so that X ′ = X ∪ v and

Y ′ = Y ∪ v. Let ∆X = Φh(X ′) − Φh(X) and ∆Y = Φh(Y ′) − Φh(Y ). It suffices to show

∆X ≥ ∆Y .

We assume X ⊂ Y , since if X = Y , the proof is straightforward. We also assume that Y

contains exactly one more POI than X , say y. The general case of containing l > 1 POIs can be

proved by repeating the argument for the assumed case l times. Besides, we assume v has feature h

(Fv,h 6= 0), otherwise, ∆X ≥ ∆Y is always true.

For every POI i ∈ X or i ∈ Y , if 0 < Fi,h < Fv,h (ranked behind v), its rank drops by one after

inserting v. The ranks of other existing POIs remain unchanged. Thus, ∆X or ∆Y consists of two

parts, i.e., the increment by v’s insertion, and the decrement by the rank drop of the POIs behind v.

As Y includes one extra POI y compared to X , v’s rank in X ′ and Y ′ can only have two cases.

Case i: If v has the same ranks in X ′ and Y ′, y must be ranked lower than v in Y ′. Then, the

increment is the same in X ′ and Y ′ but the decrement in Y ′ is larger due to the extra y. Hence,

∆X ≥ ∆Y .

Case ii: If v has different ranks in X ′ and Y ′, y must be ranked ahead of v in Y ′, and the

POIs behind v in X ′ and Y ′ are identical. Let Xj denote the POI whose rank (position) on feature

h is j in X . Similarly for X ′j , Yj and Y ′j . Assume X ′p = v, then Y ′p+1 = v. Besides, for any

q > p, X ′q = Xq−1 = Y ′q+1 = Yq. We define the increments on position p for X as ∆X(p), then

74

Page 86: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Offline

POIMap

Online

POICandidatesRetrieval

Retrievesub-indicesAccordingtoQ

SelectfinalPOIcandidates

RoutesFinding

Searchpromisingroutes

Top-kroutes

Q=

Indexing(FIandHI)

stay time on POIs, and presents an approximate solution for con-structing travel routes. These works do not consider user specificpreferences or queries like ours. [20] estimates temporal-based userpreferences but ignores features on POIs. [29] allows the user tospecify a minimum number of POI categories in a route. Withoutconsidering the exact categories, this method may return POIs ofcategories different from what a user is actually interested in.

[18] treats all the POIs with the same keyword equally and max-imizes the number of keywords covered by a route given a distancethreshold. [4] constructs an optimal route covering user-specifiedcategories of locations given a budget constraint, assuming thateach POI with a specified keyword fully meets user’s need on thiskeyword and optimizing some objective function on all edges in aroute, such as travel distance or popularity of edges. Such “all ornothing” of feature modeling cannot address general user prefer-ences such as the diversity requirement considered in our problem.

[28] adopts a keyword coverage function to measure the degreeto which a set of query keywords are covered by a route, similarto ours, and finds a route that maximizes the keyword coveragefunction for a given budget constraint. Their keyword coveragefunction supports the diminishing incremental utility of POIs ofsame feature. However, this work has several limitations. First,their pruning strategy for optimal solutions depends on the specificform of their keyword coverage function and is not applicable toother choices of keyword coverage functions; consequently, theirmethod does not support the personalized diversity requirement ofPOIs considered in our work. Our pruning strategy assumes thesubmodularity of the aggregation function, but is independent ofthe specific form of the function. Second, their keyword coveragefunction yields the maximum value whenever the feature rating ofa POI is above the average (i.e., Eqn. (4) in [28]), which ignores thecontribution of the remaining POIs on the route (i.e., Eqn. (3) in[28]). This models only the max aggregation discussed in Section1.3. Their A* based algorithm produces a single route.

Spatial Keywords Search. The works such as [5] retrieves spa-tial spatial web objects using carefully designed spatial-keywordindex structures. They do not consider the sequence of objects.

Trajectory Search. [32] and [31] consider similarity query toretrieve existing (segments of) trajectories that contain the most rel-evant keywords and yield the least travel distance. [26] constructsa route that sequentially pass the provided locations within a timespan based on segments from multiple uncertain trajectories. [8]recommends personalized driving routes from other drivers’ trajec-tories while considering driver’s travel cost preference. All theseworks assume a database of existing trajectories. Our work con-structs routes from an input POI map to best match the user’s pref-erences and personalized diversity requirement on POIs.

3. PRELIMINARYWe formally define the problem studied in this paper. Table 1

summarizes the notations frequently used throughout the paper.

3.1 Problem Statement

DEFINITION 1. [A POI Map] A POI map G = (V, E) is adirected/undirected and connected graph, where V is a set of geo-tagged POI nodes and E V V is a set of edges between nodes(i, j), i, j 2 V . H is a set of features on POIs. F 2 R|V||H|

denotes the POI-feature matrix, where Fi,h 2 [0,] is the ratingon a feature h for the POI i. Each POI i 2 V is associated witha staying cost si. Each edge ei,j 2 E has a travel cost ti,j . Ti,j

denotes the least traveling cost from i to j. 2

Table 1: Nomenclature

Notation InterpretationG = (V, E) POI map G with node set V and edge set E

H feature set on POIssi staying cost on POI i 2 V

F 2 R|V||H| POI-feature matrixFi,h the rating on feature h 2 H for POI iti,j cost on edge ei,j 2 ETi,j the least traveling cost from any POI i to any POI j

P, PV route P with the included POI set PV

Q =(x, y, b,w,,)

user query with parameters:x and y – source and destination locationb – travel cost budgetw 2 R|H| – feature preference vector 2 R|H| – filtering vector on feature ratings – feature aggregation functions

VQ POI candidates set retrieved by Qn size of VQ

Gain(PV , Q) gain of a route P given query Q

The choices of si and ti,j depend on applications and can betime, expenses, or other forms of cost.

DEFINITION 2. [Routes] A route P is a path x ! · · · i · · · !y in G from the origin x to the destination y through a sequence ofnon-repeating POIs i except possibly x = y. PV denotes the set ofPOIs on P . The cost of P is defined as

cost(P) =X

i2PV

si +X

i!j2PTi,j ,

where i, j in Ti,j are not necessarily adjacent in the POI map butare successively visited. 2

A route P contains only the POIs i that the user actually “visits”by consuming the staying time at i. Each i! j on a route is a pathfrom i to j with the least traveling cost Ti,j . Any POI on such apath other than i and j serves an intermediate node to go from i to jand will not be visited by the user. The staying times at x and/or ycan be either considered or ignored depending on whether the userwants to visit them or uses them as the start and end of a route. Thelatter case can be modeled by setting sx = sy = 0.

At the minimum, the user has an origin x and a destination yfor a route, not necessarily distinct, and a budget constraint b onthe cost of the route. In addition, the user may want the POIs tohave certain features and this can be specified by a |H|-dimensionalvector w with wh being the weight of feature h, where 1 h |H|. wh 2 [0, 1] and hwh = 1. The user can also specify afiltering vector so that Fi,h is set to 0 if it is less than h. Fi,h

denotes Fi,h after this filtering. Finally, the user may specify aroute diversity requirement, as explained in Section 1.2, through afeature aggregation function vector = (1, · · · ,m) with h

for each feature h. h(PV ) returns the aggregated rating on featureh over the POIs in PV . and are vectors, but the elements foreach of them can be uniformly set as the same if desired.

DEFINITION 3. [Query] A user query Q is a 6-tuple (x, y, b,w,,). A route P is valid if it starts from x and ends at y, andcost(P) b. The gain of P w.r.t. Q is defined as

Gain(PV , Q) =X

hwhh(PV ). 2 (1)

Only the specification of x, y, b is required; the specification ofw,, is optional, and if not provided by a user, some defaultspecification can be used. Since h is defined over the POI set

Figure 5.3: System Architecture

∆X(p) = p−α(X ′p−Xp) = p−α(X ′p−X ′p+1). Similarly ∆Y (p+1) = (p+1)−α(Y ′p+1−Yp+1) =(p + 1)−α(Y ′p+1 − Y ′p+2). It is easy to get ∆X(p) > ∆Y (p + 1). Similarly, for any q > p,

∆X(q) > ∆Y (q + 1). Hence, ∆X > ∆Y .

The user can also personalize her diversity requirement by specifying any other submodular Φh,

such as a log utility function Φh(PV ) = log(1 +∑i∈PV

Fi,h) and the coverage function Φh(PV ) =1 −

∏i∈PV

[1 − Fi,h]. Our approach is general in that it only depends on the submodularity of

Φh, but is independent of the exact choices of such functions. Our problem subsumes two NP-hard

problems, i.e., the submodular maximization problem [49] and the orienteering problem [17].

5.2.3 Framework Overview

To efficiently deal with the high computational complexity of this problem, we divide the overall

framework into the offline component and the online component, as shown in Figure 5.3. Before

processing any query, the offline component carefully indexes the POI map on feature and cost

dimensions for speeding up future POI selection and travel cost computation. The online component

responds to the user queryQwith Sub-index Retrieval that extracts the sub-indices relevant toQ, and

Routes Search that finds the top-k routes using the sub-indices. For routes search, we consider both

the exact algorithm with novel pruning strategies, and heuristic algorithms to deal with the worst

case of less constrained Q. We first introduce an indexing strategy and the Sub-index Retrieval for

a given query in Section 5.3, and then consider routes search algorithms in Section 5.4 and Section

5.5.

5.3 Indexing

In this section, we explain the offline indexing component and the Sub-index Retrieval of the online

component.

75

Page 87: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

2-Hop Index (HI)

v1: (v1,0) (v2,12) (v6,12) (v4,14)

v2: (v2,0) (v6,4) (v3,5)

v3: (v3,0) (v4,4) (v6,4)

v4: (v4,0) (v6,3)

v5: (v5,0) (v3,1) (v4,3) (v6,5)

v6: (v6,0)

Feature Index (FI)

park: (v3,1) (v2,0.8)

museum: (v1,1) (v4,0.9) (v6,0.4)

food: (v5,0.9) (v6,0.6) (v3,0.2)

Query specific Feature Index (FIQ)

park: (v3,1) (v2,0.8)

museum: (v1,1) (v4,0.9)

Q = (v6, v2, 13, ( 0.5, 0.5, 0 ), 0.6, 1)

𝑥 𝑦 𝑏 𝒘 𝜽 𝜶

Query specific 2-Hop Index (HIQ)

v1: (v1,0) (v2,12) (v6,12)

v2: (v2,0) (v6,4) (v3,5)

v3: (v3,0) (v4,4) (v6,4)

v4: (v4,0) (v6,3)

v6: (v6,0)

Offline Index Building(stored on disk)

Online Sub-index Retrieval(loaded to memory)

Figure 5.4: Left part: FI and HI built from the POI map in Figure 5.1. Right Part: Given a query Q,retrieve POI candidates VQ by retrieving the subindices FIQ and HIQ from FI and HI.

5.3.1 Offline Index Building

The POI map data is stored on disk. To answer user queries rapidly with low I/O access and speed

up travel cost computation, we build two indices, FI and HI stored on disk.

FI is an inverted index mapping each feature h to a list of POIs having non-zero rating on h. An

entry (vi,Fi,h) indicates the feature rating Fi,h for POI vi, sorted in descending order of Fi,h. FIhelps retrieving the POIs related to the features specified by a query.

The least traveling cost Ti,j between two arbitrary POIs i and j is frequently required in the

online component. To compute Ti,j efficiently, we employ the 2-hop labeling [43] for point-to-point

shortest distance querying on weighted graphs. [43] shows scalable results for finding 2-hop labels

for both unweighted and weighted graphs, and the constructed labels return exact shortest distance

queries. Our HI index is built using the 2-hop labeling method.

HI. For an undirected graph, there is one list of pivot labels for each node vi, where each label

(u, d) contains a pivot node u and the traveling cost d between vi and u. HI(vi) denotes the list

of labels for vi, sorted in the ascending order of d. According to [43], Ti,j between vi and vj is

computed by

Ti,j = min(u,d1)∈HI(vi)∩(u,d2)∈HI(vj)

(d1 + d2). (5.4)

Figure 5.4 (left part) shows the FI and HI for the POI map in Figure 5.1. For example, to compute

T2,5, we search for the common pivot nodes u from the pivot label lists of v2 and v5 and find that v3

minimizes the traveling cost between v2 and v5, so T2,5 = 5 + 1 = 6.

76

Page 88: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

In the case of a directed graph, each POI vi will have two lists of labels in HI, HI(vouti ) for vias the source, and HI(vini ) for vi as the destination. And we simply replace vi with vouti and vj with

vinj in Eqn. (5.4) to compute Ti,j .

5.3.2 Online Sub-index Retrieval

Given a query Q, the first thing is to retrieve the POI candidates VQ that are likely to be used in the

routes search part. In particular, the POIs that do not contain any feature in the preference vector wor do not pass any threshold in θ will never be used, nor the ones that cannot be visited on the way

from the source x to the destination y within the budget b. This is implemented by retrieving the

query specific sub-indices FIQ from FI and HIQ from HI.Figure 5.4 (right part) shows how the retrieval works for a query Q = (x = v6, y = v2, b =

13,w = (0.5, 0.5, 0),θ = 0.6,α = 1), where the weights in w are for (Park, Museum, Food), and

α is the power law exponent in Eqn. (5.3). Here the elements in each vector θ and α have the same

value for all features.

FIQ, a sub-index of FI, is retrieved using w and θ. w directly locates the lists for the user

preferred (with wh > 0) features. θ is used to cut off lower rated POIs on the sorted lists indicated

by red scissors. VQ = v1, v2, v3, v4 contains the remaining POIs.

HIQ, a sub-index of HI, is then formed by retrieving the lists for each POI in VQ and also those

for x and y, and b is used to cut off the sorted lists, indicated by red scissors. We also check whether

a POI i in current VQ is actually reachable by checking the single-point visit cost: if mx + Tx,i +mi + Ti,y + my > b, we remove i from VQ and remove its list from HIQ, as indicated by the blue

shading. Then we get the final POI candidates VQ. Typically, |VQ| |V|.FIQ and HIQ are retrieved only once and kept in memory.

5.4 Optimal Routes Search

With POI candidate set VQ and the sub-indices extracted, the next step is the Routes Search phase.

We present an optimal routes search algorithm in this section. Considering the complexity and gen-

erality of the problem, a standard tree search or a traditional algorithm for the orienteering problem

does not work. An ideal algorithm design should meet the following goals: i. search all promising

routes in a smart manner without any redundancy; ii. prune unpromising routes as aggressively as

possible while preserving the optimality of the top-k answers; iii. ensure that the search and pruning

strategies are applicable to any nonnegative, monotone and submodular aggregation functions Φh.

To this end, we propose a novel algorithm, Prefix bAsed Compact statEs gRowth (PACER), that

incorporates the idea of dynamic programming and fuses a cost-based pruning strategy and a gain-

based pruning strategy in an unified way. Next, we present our enumeration and pruning strategies,

followed by the algorithm and the complexity analysis.

77

Page 89: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

5.4.1 Search Strategy

A routeP is associated with several variables:PV ,Gain(PV ), the ending POI end(P), and cost(P).

If x is not visited, mx and Fx,h for every h are set to 0; the same is applied to y. A POI sequence

is an open route if it starts from x and visits several POIs other than y; it is a closed route if it starts

from x and ends at y. The initial open route includes only x. An open route P is feasible if its closed

form P → y satisfies cost(P → y) ≤ b. In the following discussion, P denotes either an open route

or a closed route. An open route P− with end(P−) = i can be extended into a longer open route

P = P− → j by a POI j 6∈ P−V ∪ y. The variables for P arePV = P−V ∪ jGain(PV ) =

∑hwhΦh(PV )

end(P) = j

cost(P) = cost(P−) + Ti,j +mj .

(5.5)

Based on the similar intuition as in Section 4.5.1 in our last work, PV and Gain(PV ) depend

on the POI set of the route P but are independent of how the POIs are ordered. Hence, we group

all open routes sharing the same PV as a compact state C, and let CL denote the list of open routes

having C as the POI set. C is associated with the following fields:Gain(C) : the gain of routes grouped by CCL : ∀P ∈ CL, end(P), cost(P).

(5.6)

These information is cached in a hash map with C as the key.

Besides, remind that the prefix based depth-first search method for enumerating the compact

states proposed in last work (see Section 4.5 and Figure 4.2 for details) has many good properties:

(1) its enumeration order ensures that when computing a route P , the feasible sub-routes of P are

always already computed earlier, which enables to construct open routes incrementally; (2) it is a

compact data model such that the information for the routes are stored as a hash map in memory

without any redundancy; Therefore, we can adapt the prefix based depth-first search method to the

design of the optimal algorithm for finding the top-k routes in this work.

5.4.2 Cost-based Pruning Strategy

As is illustrated in Section 4.5, the prefix based depth-first search method incorporates the domi-

nance based pruning strategy as introduced in Section 4.4.1 in a natural way. As a result, we only

keep at most |C| dominating open routes at a compact state C, with each dominating open route

having a POI j ∈ C as the ending. Along with the enumeration of the routes, a closed route P → y

for each feasible and dominating open route P in a compact state C is used to update the top-k

routes topK. We call the adopted dominance based pruning Pruning-1: cost dominance pruning,

as the dominance is based on the travel cost.

78

Page 90: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Though all dominated open routes are pruned, many of the remaining dominating open routes

are actually still unpromising to lead to the top-k closed routes. It will be quite wise if we can also

prune the large number of unpromising dominating open routes in advance, instead of extending

them step by step and finally finding that the extended close routes are far worse than the top-k

routes. This further motivates our next pruning strategy.

5.4.3 Gain-based Pruning Strategy

We can extend a dominating open routeP step by step using the remaining budget ∆b = b−cost(P)into a closed route P → P . The POIs used for extension at each step should be reachable from the

current end(P), therefore, chosen from the set

U = i|Tend(P),i +mi + Ti,y +my ≤ ∆b, (5.7)

where i is an unvisited POI other than y. Tend(P),i and Ti,y can be computed through HIQ. P → Phas gain Gain(PV ∪ PV ). Then the marginal gain by concatenating P to the existing P is

∆Gain(PV |PV ) = Gain(PV ∪ PV )−Gain(PV ). (5.8)

Let P → P∗ denote the P → P with the highest gain. If P → P∗ ranks lower than the current k-th

top routes topK[k], P is not promising and all the open routes extended from P can be pruned.

Pruning-2: marginal gain upper bound pruning. However, finding P∗ is as hard as finding

an optimal route from scratch, so we seek to estimate an upper bound UP of the marginal gain

∆Gain(PV |PV ), such that ifGain(PV )+UP is less than the gain of topK[k], P is not promising,

thus, P and all its extensions can be pruned without affecting the optimality. We call this marginal

gain upper bound pruning. As more routes are enumerated, the gain of topK[k] increases and this

pruning becomes more powerful.

The challenge of estimating UP is to estimate the cost of the extended part P without knowing

the order of the POIs. Because ∆Gain(PV |PV ) is independent of the POIs’ order, we can ignore

the order and approximate the “route cost” by a “set cost”, i.e., the sum of some cost c(i) of each

POI i ∈ PV , where c(i) is no larger than i’s actual cost when it is included in P . We define c(i) as:

c(i) = mi +min(tj,i)/2 +min(ti,k)/2, (5.9)

where tj,i is the cost on an in-edge ej,i and ti,k is the cost on an out-edge ei,k. As the order of POIs

is ignored, it is easy to verify thatmin ensures the above property of c(i). The destination y is “one-

sided”, i.e., c(y) = my +min(tj,y)/2. To make a tighter cost approximation, we also count the half

out-edge cost min(tend(P),k)/2 for end(P). (This “set cost” can further approach the “route cost”

by replacing tj,i and ti,k with Tj,i and Ti,k, respectively, and choosing j and k from U , but finding

min(Tj,i) and min(Ti,k) will incur the computation cost.)

79

Page 91: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Then, UP is exact the solution, i.e., the maximum ∆Gain(S∗|PV ), to the following optimiza-

tion problem:

maxS⊆U∪y

∆Gain(S|PV ) s.t.∑

i∈Sc(i) ≤ B, (5.10)

where U is defined in Eqn. (5.7) and B = ∆b−min(tend(P),k)/2. Note that S should include y be-

cause end(P) = y. As c(i) and c(end(P)) are no larger than their actual costs, ∆Gain(S∗|PV ) ≥∆Gain(PV |PV ) for any P . Thus, using ∆Gain(S∗|PV ) asUP never loses the optimality. To solve

Eqn. (5.10), we first show the properties of the marginal gain function ∆Gain.

Theorem 9. The marginal gain function ∆Gain as defined in Eqn. (5.8) is nonnegative, monotone

and submodular.

Proof. We only show that ∆Gain is submodular, as the proof of other properties is straightforward.

According to [49], if a set function g : 2V → R is submodular, and X,Y ⊂ V are disjoint,

the residual function f : 2Y → R defined as f(S) = g(X ∪ S) − g(X) is also submodular.

Since Gain is submodular (Theorem 7) and since PV ,U ⊂ V are disjoint, ∆Gain(PV |PV ) =Gain(PV ∪ PV )−Gain(PV ) is residual on PV , thus, is submodular.

Apparently, Eqn. (5.10) is a submodular maximization problem subject to a knapsack constraint,

which unfortunately is also NP-hard [49]. Computing ∆Gain(S∗|PV ) is costly, thus, we consider

estimating its upper bound.

One approach, according to [91], is to run a Ω(B|U|4) time (B is defined in Eqn. (5.10))

greedy algorithm in [46] to obtain an approximate solution ∆Gain(S′|PV ) for the above prob-

lem with approximation ratio of 1 − e−1, then the upper bound of ∆Gain(S∗|PV ) is achieved by

∆Gain(S′|PV )/(1− e−1). A less costly version of this algorithm runs in O(B|U|) but its approx-

imation ratio is 12(1− e−1).

Compared with the above mentioned offline bounds, i.e., 1− e−1 and 12(1− e−1) that are stated

in advance before running the actual algorithm, the next theorem states that we can instead use the

submodularity to acquire a much tighter online bound.

Theorem 10. For each POI i ∈ U ∪ y, let δi = ∆Gain(i|PV ). Let ri = δi/c(i), and let

i1, · · · , im be the sequence of these POIs with ri in decreasing order. Let l be such that C =∑l−1j=1c(ij) ≤ B and

∑lj=1c(ij) > B. Let λ = (B − C)/c(il). Then

UP =∑l−1

j=1δij + λδil ≥ ∆Gain(S∗|PV ). (5.11)

Proof. [56] showed a theorem that a tight online bound for arbitrary given solution A (obtained

using any algorithm) to a constrained submodular maximization problem can be got to measure

how far A is from the optimal solution. By applying [56] to the problem in Eqn. (5.10) and let

A = ∅, Theorem 10 is deduced.

80

Page 92: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Algorithm 5: PACER(C−, I) (Recursive funcion)Global Input: Q = (x, y, b,w,θ,Φ), VQ, FIQ and HIQ to compute Gain(C) and cost(P),

and kParameters : compact state C− and the set of POIs IOutput : a priority queue topK

1 forall POI i in set I in order do2 C← i ∪ C−;3 compute Gain(C);4 forall POI j in C do5 C−j ← C \ j;6 P− ← the dominating route in C−jL such that cost(P− → j) is minimum;

// pruning-17 P ← P− → j;8 if cost(P → y) ≤ b then9 Compute UP using Eqn. (5.11);

10 if Gain(C) + UP ≥ Gain(topK[k]) then11 insert route P into CL; // pruning-2

12 UpdateTopK(CL, topK);13 PACER(C, prefix of i in I);

By this means,UP is computed without running a greedy algorithm. We also empirically proved

that this online bound in Eqn. (5.11) outperforms the offline bounds on both tightness and compu-

tational cost. Thus, we finally choose the online bound.

5.4.4 Algorithm

Algorithm 5 describes our algorithm PACER, which incorporates the above enumeration and multi-

ple pruning strategies. Given the global variables, PACER(C−, I) recursively enumerates the sub-

tree at the current compact state C− with the POI set I available for extending C−, and finally return

the k best routes in topK. The initial call is PACER(∅,VQ), when only x is included.

As explained in Section 5.4.1, Line 1 - 3 extends C− by each i in the set I in order, creating the

child node C and computing Gain(C). Lines 4 - 11 generate the dominating and promising open

routes CL. Specifically, for each j ∈ C selected as the ending POI, Line 5 - 6 find the dominating

route P− from the previously computed C−jL . This corresponds to Pruning-1. Only when the new

open route P is feasible, Pruning-2 is applied to check if P has the potential to be extended to a

closed route no worse than the current topK routes, and if so, P is inserted into CL (Lines 8 - 11).

After CL is finalized, it selects an open route P in CL such that P → y has the least cost to update

topK (Line 12). The information of the new compact state C, as in Eqn. (5.6), is added to the hash

map. At last, C is extended recursively with the POIs in the prefix of i in current I (Line 13).

Let us summarize the good properties of PACER as follows.

81

Page 93: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Properties of PACER. (1) PACER works for any nonnegative, monotone, and submodular

Gain function so as to deal with the personalized diversity requirement. (2) Open routes are enu-

merated as compact states in a prefix-first depth-first order, which enables constructing open routes

incrementally, i.e., dynamic programming. (3) Armed with Pruning-1, we compute at most |C|dominating feasible open routes at each compact state C, instead of |C|! routes. (4) Pruning-2 fur-

ther wees out the dominating feasible open routes not having a promising estimated maximum gain.

This pruning is tightened up as more closed routes are enumerated.

5.4.5 Complexity Analysis

We measure the computational complexity by the number of routes examined. Two main factors

having the impact on this measure are the size of the POI candidate set, i.e., |VQ| denoted by n, and

the maximum length of routes examined (excluding x and y), i.e., the maximum |P| denoted by p.

p n. We analyze PACER relatively to the brute-force search and a state-of-the-art approximation

solution.

PACER. The compact states on the l-th level of the enumeration tree (Figure 4.2) compute the

routes containing l POIs, thus, there are at most(nl

)compact states on level l, and thanks to Pruning-

1, each compact state represents at most l dominating open routes, each computed only once. There

are n dominating open routes with single POI on level l = 1. Starting from l = 2, to generate each

dominating open route on level l, we need to examine (l−1) sub-routes having the same set of POIs

and add the same ending to determine which is the dominating one according to the cost dominance

pruning strategy. Therefore, with p n and the Pascal’s rule [11], the number of routes examined

is at most

n+p∑l=2

l(l − 1)(n

l

)= n+ n(n− 1)

p∑l=2

(n− 2l − 2

)≈ n(n− 1)(

(n− 2p− 2

)+(n− 2p− 3

))

= n(n− 1)(n− 1p− 2

)= n− 1

(n− p+ 1)(p− 2)!n!

(n− p)! .(5.12)

Therefore, the computation cost of PACER is O( 1(p−2)!

n!(n−p)!) with p n. If Pruning-2 is also en-

abled and it prunes the γ percent of the routes examined by PACER with Pruning-1, the computation

cost of PACER is O((1− γ) 1(p−2)!

n!(n−p)!).

Brute-force algorithm (BF). The brute-force algorithm based on the breadth-first expansion is

a full permutation of p POIs chosen from the n candidate POIs, therefore, it examines O( n!(n−p)!)

routes, which is (p − 2)! times of that for PACER with only Pruning-1. In general, the next POI

visited in a route does not have to be an immediate neighbor of the previous one.

Approximation algorithm (AP). [17] proposed a quasi-polynomial time approximation algo-

rithm for the Orienteering Problem. We modified AP to solve our problem. It uses a recursive binary

search to guess the middle node of a route and produce a single route with the approximation ratio

dlog pe+1 and runs inO((n·OPT ·log b)log p), whereOPT and b are the numbers of discrete value

82

Page 94: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

for an estimated optimal Gain and for the budget, respectively. The cost is extremely expensive if

b or OPT has many discrete values, even worse than the brute-force algorithm. For example, for

b = 512 minutes, n = 50, p = 8 and OPT = 10.0 (100 discrete values with the single decimal

point precision), the computation cost is (50 · 100 · log 512)log 8 = 9.11× 1013. [86] noted that AP

took more than 104 seconds for a small graph with 22 nodes. Compared with AP, the computation

cost of PACER with Pruning-1 given by Eqn. (5.12) is only 50 × 49 ×(49

6)

= 3.43 × 1010. This

cost is further reduced by Pruning-2. PACER finds the optimal top-k routes whereas AP only finds

single approximate solution. We will experimentally compare PACER with AP.

5.5 Heuristic Methods

PACER remains expensive for a large cost budget b and a large POI candidate set VQ. The state-

of-art approximation algorithm [17] as mentioned above is shown not scalable. Therefore, in this

section, we design two heuristics when such extreme cases arise.

State collapse heuristic. The cost dominance pruning in PACER keeps at most l open routes

for a compact state representing a set of l POIs, excluding x and y. A more aggressive pruning

is to keep only one open route having the least cost at each compact state, with the heuristic that

this route likely visits more POIs. We denote this heuristic algorithm by PACER-SC, where SC

stands for “State Collapsing”. Clearly, PACER-SC trades optimality for efficiency, but it inherits

many nice properties from PACER and Section 5.6.2 will show that it usually produces k routes

with quite good quality.

Analogous to the complexity analysis for PACER in Section 5.4.5, PACER-SC examines no

more than∑pl=1 l

(nl

)≈ n

( np−1)

routes, if p n. Thus, the computation cost of PACER-SC is

around 1/p of that for PACER.

Greedy algorithm. PACER-SC’s computation complexity remains exponential in the route

length p. Our next greedy algorithm runs in polynomial time. It starts with the initial route x → y

and iteratively inserts an unvisited POI i to the current route to maximize the marginal gain/cost

ratioGain(i ∪ C)−Gain(C)

mi + Tx,i + Ti,y, (5.13)

where C denotes the set of POIs on the current route. It inserts i between two adjacent POIs in the

current route so that the total cost of the resulting route is minimized. The term Tx,i+Ti,y constrains

the selected POIs i to be those not too far away from the two end points. The expansion process

is repeated until the budget b is used up. The algorithm only produces a single route and examines

O(pn) routes because each insertion will consider at most n unvisited POIs.

5.6 Experimental Evaluation

All algorithms were implemented in C++ and were run on Ubuntu 16.04.1 LTS with Intel i7-3770

CPU @ 3.40 GHz and 16G of RAM.

83

Page 95: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

5.6.1 Experimental Setup

Datasets

We use two real-world datasets from [120]. Singapore denotes the Foursquare check-in data col-

lected in Singapore, and Austin denotes the Gowalla check-in data collected in Austin. Singapore

has 189,306 check-ins at 5,412 locations by 2,321 users, and Austin has 201,525 check-ins at 6,176

locations by 4,630 users. Same as suggested in [12, 120], we built an edge between two locations

if they were visited on the same date by the same user. The locations not connected by edges were

ignored. We filled in the edge costs ti,j by querying the traveling time in minute using Google Maps

API 1 under driving mode. The staying timemi were generated following the Gaussian distribution,

mi ∼ N (µ, σ2), with µ = 90 minutes and σ = 15. The features are extracted based on the user

mentioned keywords at check-ins, same to [120]. We obtain the rating of a feature h on POI i by

Fi,h = min NCh(i)1/|Sh| ×

∑j∈Sh

NCh(j) ×β

2 , β, (5.14)

where NCh(i) is the number of check-ins at POI i containing the feature h, Sh is the set of POIs

containing h, β is the maximum feature rating and is set to β = 5 for both data sets. The calculation

scales the middle value β2 by the ratio of a POI’s check-in count to the average check-in count on h.

We emphasize that while we need to choose a specific way to derive the feature rating Fh(i) Note

that our algorithm is orthogonal to how Fi,h is derived. Table 5.2 shows the descriptive statistics of

the datasets after the above preprocessing.

Table 5.2: Dataset statistics

# POI # Edges Average ti,j # FeaturesSingapore 1,625 24,969 16.24 minutes 202

Austin 2,609 34,340 11.12 minutes 252

Note that both datasets are previously used in [120], which also studied a route planning prob-

lem, and the size of the datasets is not small considering the scenario for a daily trip in a city where

the user has a limited cost budget. Even with a very small number of POIs, say 150, to choose from,

the number of possible routes consisting of 5 POIs can reach 70 billions. Compared to our work,

[65] evaluated its itinerary recommendation methods using theme park data, where each park in fact

contains only 20 to 30 attractions.

Algorithms

We compared the following algorithms. BF is the brute-force method (Section 5.4.5). PACER+1is our proposed optimal algorithm with only Pruning-1 enabled. PACER+2 enables both Pruning-

1 and Pruning-2. PACER-SC is the state collapse algorithm and GR is the greedy algorithm in

1https://developers.google.com/maps/documentation/distance-matrix/

84

Page 96: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Section 5.5. AP is the approximation algorithm proposed by [17] (see Section 5.4.5). A* is the A*

algorithm proposed by [120]. Since A* works only for its specific keyword coverage function, it is

not compared until Section 5.6.3 where we adapt their coverage function in our method. To be fair,

all algorithms use the indices in Section 3.4.1 to speed up. Note that BF, PACER+1, PACER+2 and

A* are exact algorithms, while PACER-SC, GR, and AP are greedy or approximation algorithms.

Queries

A queryQ has the six parameters x, y, b,w, θ,Φ. For concreteness, we choose Φh in Eqn. (5.3) with

α controlling the diversity of POIs on a desired route. We assume θh and αh are the same for all

features h. For Singapore, we set x as Singapore Zoo and y as Nanyang Technological University;

and for Austin, we set x as UT Austin and y as Four Seasons Hotel Austin.

For each dataset, we generated 50 weight vectors w to model the feature preferences of 50 users

as follows. For each w, we draw l features, where l is a random integer in [1, 4], and the probability

of selecting each feature h is Pr(h) =∑

i∈ShNCh(i)∑

h∈H

∑i∈Sh

NCh(i) . NCh(i) and Sh are defined in Eqn.

(5.14). LetHQ be the set of selected features. For each h ∈ HQ, wh =∑

i∈ShNCh(i)∑

h∈HQ

∑i∈Sh

NCh(i) .

Finally, we consider b ∈ 4, 5, 6, 7, 8, 9 in hours, θ ∈ 0, 1.25, 2.5, 3.75, andα ∈ 0, 0.5, 1, 2with the default settings in bold face. For each setting of b,θ,α, we generated 50 queries Q =(x, y, b,w,θ,α) using the 50 vectors w above. All costs are in minutes, therefore, b = 5 specifies

the budget of 300 minutes.

We first evaluate the performance of our proposed algorithms (Section 5.6.2), then we compare

with the A* algorithm (Section 5.6.3).

5.6.2 Performance Study

Evaluation metrics. As we solve an optimization problem, we evaluate Gain for effectiveness,

CPU runtime and search space (in the number of examined open routes) for efficiency.

For every algorithm, we evaluate the three metrics for processing a query, and report the average

for the 50 queries (i.e., vectors w) under each setting of (b,θ,α) chosen from the above ranges. GR

and AP only find single route, thus, we first set k = 1 to compare all algorithms, and discuss the

impact of larger k at the end of this section.

Figures 5.5 and 5.6 report the experiments for Singapore and Austin, respectively. Each row

corresponds to various settings of one of b,θ,α while fixing the other two at the default settings.

OPTIMAL denotes the same optimal gain of PACER+2, PACER+1 and BF. We terminated an al-

gorithm for a given query after it runs for 1 hour or runs out of memory, and used the label beside

a data point to indicate the percentage of finished queries. If more than an half of the queries were

terminated, no data point is shown.

85

Page 97: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

BF PACER+1 PACER+2

PACER-SC GR AP

OPTIMAL PACER-SC

GR AP

4 5 6 7 8 9

Time budget b (hour)

10-3

10-1

101

103

Ru

ntim

e (

se

c) 31/50

42/50

(a) Runtime vs. b

4 5 6 7 8 9

Time budget b (hour)

102

104

106

108

1010

# o

f ro

ute

s 42/5031/50

(b) # of routes vs. b

4 5 6 7 8 9

Time budget b (hour)

4

6

8

10

12

14

Ga

in

(c) Gain vs. b

0 1.25 2.5 3.75

10-3

10-1

101

103

Ru

ntim

e (

se

c)

(d) Runtime vs. θ

0 1.25 2.5 3.75

102

104

106

108

1010

# o

f ro

ute

s

(e) # of routes vs. θ

0 1.25 2.5 3.754

6

8

10

Ga

in

(f) Gain vs. θ

0 0.5 1 2

10-3

10-2

10-1

100

101

Ru

ntim

e (

se

c)

(g) Runtime vs. α

0 0.5 1 2 10

1

103

105

107

# o

f ro

ute

s

(h) # of routes vs. α

0 0.5 1 2 4

8

12

16

Ga

in

(i) Gain vs. α

1 10 100

k

10-1

100

101

Ru

ntim

e (

se

c)

(j) Runtime vs. k

1 10 100

k

103

105

107

# o

f ro

ute

s

(k) # of routes vs. k

1 10 100

k

4

8

12

Ga

in

(l) Gain vs. k

Figure 5.5: Experimental results for Singapore. Run time and search space (# of routes) are inlogarithmic scale. The labels beside data points indicate the ratio of queries successfully respondedby the algorithm under the parameter setting. No label if no query fail. Data point or bar is not drawnif more than half fail. AP can only respond queries with small b. GR and AP can only find top-1.

86

Page 98: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

BF PACER+1 PACER+2

PACER-SC GR AP

OPTIMAL PACER-SC

GR AP

4 5 6 7 8 9

Time budget b (hour)

10-3

10-1

101

103

Ru

ntim

e (

se

c) 37/50 40/50

(a) Runtime vs. b

4 5 6 7 8 9

Time budget b (hour)

102

104

106

108

1010

# o

f ro

ute

s

37/50 40/50

(b) # of routes vs. b

4 5 6 7 8 9

Time budget b (hour)

4

6

8

10

12

14

Ga

in

(c) Gain vs. b

0 1.25 2.5 3.75

10-3

10-1

101

103

Ru

ntim

e (

se

c)

46/50

(d) Runtime vs. θ

0 1.25 2.5 3.7510

2

104

106

108

1010

# o

f ro

ute

s46/50

(e) # of routes vs. θ

0 1.25 2.5 3.754

6

8

10

Ga

in

(f) Gain vs. θ

0 0.5 1 2

10-3

10-2

10-1

100

101

Ru

ntim

e (

se

c)

(g) Runtime vs. α

0 0.5 1 2 10

1

103

105

107

# o

f ro

ute

s

(h) # of routes vs. α

0 0.5 1 2 4

8

12

16

Ga

in

(i) Gain vs. α

1 10 100

k

10-1

100

101

Ru

ntim

e (

se

c)

(j) Runtime vs. k

1 10 100

k

103

105

107

# o

f ro

ute

s

(k) # of routes vs. k

1 10 100

k

4

8

12

Ga

in

(l) Gain vs. k

Figure 5.6: Experimental results for Austin

Impact of budget b

Figures 5.5a - 5.5c and 5.6a - 5.6c show the results as b varies. b affects the length of routes (the

number of POIs included). As b increases, almost all the algorithms become slower and their search

space increases.

87

Page 99: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

AP is the worst. This is consistent with the analysis in Section 5.4.5 that AP suffers from a high

complexity when b and OPT have many discrete values. b = 6 has 360 discrete values in minute, a

majority of the queries cannot finish. The efficiency of BF drops dramatically as b increases, since

the number of open routes becomes huger and processing them is both time and memory consuming.

PACER+1’s search space is two orders of magnitude smaller than that of BF, thanks to the

compact state enumeration and the cost dominance pruning. PACER+2 is the best among all the

exact algorithms. Compared with PACER+1, the one order of magnitude speedup in runtime and two

orders of magnitude reduction in search space clearly demonstrates the additional pruning power of

the Gain based upper bound pruning. PACER-SC trades optimality for efficiency. Surprisingly, as

shown in Figure 5.5c and 5.6c, PACER-SC performs quite well with Gain being close to that of

OPTIMAL.

GR always finishes in less than 10−2 seconds. For Singapore, the achieved gain is far worse than

that of OPTIMAL, compared with the difference for Austin. This is because x and y for Singapore

are relatively remote to the central city. GR will greedily select a POIs i not too far away from x and

y (Eqn. (5.13)), thus, many POIs with possibly higher feature ratings located in the central city are

less likely to be chosen. In contrast, x and y for Austin are in the downtown area and this situation

is avoided in most cases.

Impact of of filtering threshold θ

In Figures 5.5d - 5.5f and 5.6d - 5.6f, there is no feature cut-off when θ = 0. As θ becomes larger,

the size of the POI candidate set is reduced and all the algorithms run faster. The majority of the

experiments for AP cannot finish and its results are not shown. The study suggests that a reasonable

value of θ, e.g., 2.5, reduces the searching cost greatly while having little loss on the quality of the

found routes.

Impact of diversity parameter α

Figures 5.5g - 5.5i and 5.6g - 5.6i show the various setting of α that represent user’s route diversity

requirements. PACER+2 and PACER-SC are slightly affected when α varies. As α increases, the

marginal return diminishes faster and Φh behaves more towards the max aggregation. In this case,

Pruning-2 becomes less effective. When α = 0, Eqn. (5.3) becomes the sum aggregation and both

Gain and the difference between OPTIMAL and GR reach the maximum.

Figure 5.7 illustrates the effectiveness of our power law function in Eqn. (5.3) for modeling

the personalized route diversity requirement. We run two queries on Singapore, one with α =0.5, which specifies a diversity requirement, and one with α = 0, which specifies the usual sum

aggregation. The other query parameters are the same. The figures show the best routes found for

each query, with the POIs on a route labeled sequentially as A, B · · · . The red dots represent the

source x and destination y. The route for α = 0.5 covers all specified features, i.e., two POIs for

each feature, while maximizing the total Gain. While the route for α = 0 has four parks out of five

POIs due to the higher weight of Park in w, thus, it is less preferred by a user who values diversity.

88

Page 100: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

4 5 6 7 8 9Time budget (hour)

10-1

101

103

Run

time

(sec

) PACER+2A*

(a) Run time for SG

4 5 6 7 8 9

Time budget (hour)

103

105

107

109

# o

f tr

ips

PACER+2

A*

(b) Search space for SG

4 5 6 7 8 9Time budget (hour)

10-1

101

103

Run

time

(sec

) PACER+2A*

39/50

(c) Run time for AS

4 5 6 7 8 9

Time budget (hour)

103

105

107

109

# o

f tr

ips

PACER+2

A*

39/50

(d) Search space for AS

Figure 9: PACER+2 vs A*, shown in logarithmic scale.

simple sum aggregation. In this case, the diminishing return effectdisappears, both the absolute value of Gain and the difference ofGain between OPTIMAL and GR are the largest. As α increases,both of them decrease.

7.3 Comparison with A*As stated in Section 7.1.2, the A* algorithm is only workable

under their keyword coverage function, namely replace Eqn. (2)with the following:

fh(PV ) = 1−∏

i∈PV

[1− Fh(i)] (21)

In the original paper [28], Fh(i) is required to be in the range [0, 1]and it is directly set to 1 if the number of check-ins on POI i forfeature h is above average. In this case single POI easily reachesthe full coverage of h and do not need to explore other POIs havingh any more. This we believe is the main reason that the run time issmall in their experiments. Besides, the budget in [28] is measuredin distance and the maximum budget is 15 kilometers in their effi-ciency study, which is around 20 minutes by Google Maps underdriving mode and just captures a small area of a city.

To thoroughly measure the scalablility of the algorithms, we fol-low Eqn. (18) to aggregate feature ratings but set β = 0.5, and alsouse time budget and feed A* with the same POI maps. A* in [28]blindly enumerates all nodes that are adjacent to the current node,regardless whether the node has the required features. Besides,the pair-wise shorted traveling cost is computed in pre-processingphase. To speed up their method and to be fair when comparingwith our method, we modify their algorithm to also equip our in-dexing methods, then the only differences are the search and prun-ing strategies.

Figure 9 shows the comparison between PACER+2 and the mod-ified A* on both datasets. Since they both find optimal trips withsame Gain, we only need to compare their efficiency. Obviously,PACER+2 outperforms A*, especially when b is large. Severalqueries of A* on AS data even fail when b = 9 hours. Although A*is embedded with carefully designed pruning strategy specificallyfor Eqn. (21), the search strategy itself is a bottleneck. As is known,the embedded heuristic function makes A* to be a combination ofdepth-first search (DFS) and breath-first search (BFS), but essen-tially A* is still an exhaustive algorithm, thus its time complexityshould be at the same order of DFS or BFS. The experiments in[28] also show that A* is just 2-3 times faster than the brute-forcealgorithm. As for PACER+2, it implements the idea of dynamicprogramming to reuse intermediate computation results, besides,

Hilton Singapore

A: National Museum

B: Fort Canning Park

D: Esplanade Park

F: Tiong Shian Eating House

C: Singapore Art Museum

E: Peach Garden

(a) PACER+2 top-1

Hilton Singapore →A (M:5.0): National Museum →B (P:4.1): Fort Canning Park →C (M:5.0): Singapore Art Museum →D (P:3.1): Esplanade Park →E (R:3.6): Peach Garden →F (R:5.0): Tiong Shian Eating House →Hilton Singapore;Gain = 7.34115; Cost = 540 mins

(b) Top-2 results by PACER+2

Hilton SingaporeA: National Museum

B: Fort Canning Park

E: Bukit Timah Nature Reserve

D: Tampines Eco Green

C: Bedok Reservoir Park

(c) PACER-SC Top-1

Hilton Singapore →A (M:5.0): National Museum →B (P:4.1): Fort Canning Park →C (P:4.0): Bedok Reservoir Park →D (P:5.0): Tampines Eco Green →E (P:4.8): Bukit Timah Nature Reserve →Hilton SingaporeGain = 8.66; Cost = 535 mins

(d) PACER-SC Top-2

Figure 10: Case study

both the more general gain-based upper bound pruning and the costdominance pruning play important roles. The complexity analysisin Section 5.4 also shows that the computation cost of our algorithmis much less than brute-force on theoretical level.

7.4 Case StudyWe show a real world case study as in Figure 10 based on the

experiments on SG data. The query can be found in the caption ofthe figure. We compare the top trips returned by PACER+2 underdifferent diversity requirements by setting α = 0.5 and α = 0respectively. Due to space limitation, we only show the top-1 tripfor each. We visualize them in the left-side figures, where the reddots represent the source and destination and it visits the POIs withlabels A, B · · · as the alphabetical order. The right-side figuresshow the detailed trip sequences with feature ratings on POIs, aswell as the Gain and Cost of the trips.

As is shown, when α is set as 0.5, the top-1 trip covers two parks,two museums and two Chinese restaurants. That is the algorithm isenable to find diverse trips while maximizing the totalGain. Whendiversity is not enforced by setting α = 0, the top-1 trip mainlyfocus on the feature “Park” (4 out of 5) due to its higher weightby the user, thus, lose the diversity and “Chinese Restaurant” istotally not covered. For the trip in Figure ??, if we set α = 0.5 andrecompute its Gain, it is only 6.6045, which is smaller than that ofthe trip in Figure ??.

8. CONCLUSIONS

9. REFERENCES[1] T. Back, D. B. Fogel, and Z. Michalewicz. Handbook of

evolutionary computation. New York: Oxford, 1997.[2] S. Basu Roy, G. Das, S. Amer-Yahia, and C. Yu. Interactive

itinerary planning. In ICDE, pages 15–26. IEEE, 2011.[3] D. M. Burton. Elementary number theory. Tata McGraw-Hill

Education, 2006.[4] X. Cao, L. Chen, G. Cong, and X. Xiao. Keyword-aware

optimal route search. Proceedings of the VLDB Endowment,5(11):1136–1147, 2012.

[5] X. Cao, G. Cong, C. S. Jensen, and B. C. Ooi. Collectivespatial keyword querying. In Proceedings of the 2011 ACMSIGMOD, pages 373–384. ACM, 2011.

(a) α = 0.5 (with diversity requirement)

4 5 6 7 8 9Time budget (hour)

10-1

101

103

Run

time

(sec

) PACER+2A*

(a) Run time for SG

4 5 6 7 8 9

Time budget (hour)

103

105

107

109

# o

f tr

ips

PACER+2

A*

(b) Search space for SG

4 5 6 7 8 9Time budget (hour)

10-1

101

103

Run

time

(sec

) PACER+2A*

39/50

(c) Run time for AS

4 5 6 7 8 9

Time budget (hour)

103

105

107

109

# o

f tr

ips

PACER+2

A*

39/50

(d) Search space for AS

Figure 9: PACER+2 vs A*, shown in logarithmic scale.

The A* algorithm [29] only works for their keyword coveragefunction fh defined by

fh(PV ) = 1−∏

i∈PV

[1− Fθ,h(i)] (25)

In [29], Fθ,h(i) is in the range [0, 1] and it is set to 1 if the numberof check-ins on POI i for feature h is above average. In this case, fhis in fact the max aggregation because the single POI inP will yieldthe maximum fh(PV ) value; the feature h of other POIs will not beconsidered. To compare our algorithm with A*, we set β = 0.5 inEqn. (22). To speed up their method and to be fair when comparingwith our method, we also reduce the search space of A* to the POICandidate set retrieved by our indices; the only difference is thesearch and pruning strategies. Note that the budget b in [29] wasmeasured by distance and the maximum b is 15 kilometers in theirefficiency study, which is about 20 minutes by Google Maps underdriving mode, much smaller than our settings of 4 to 9 hours.

Figure 9 shows the comparison between PACER+2 and the mod-ified A* on both datasets. Since both algorithms produce optimaltrips, the comparison of Gain is omitted. PACER+2 outperformsA*, especially for a large b. Several queries on A* on the AS dataeven failed for b = 9 hours. Although A* has a carefully designedpruning strategy specifically for Eqn. (25), the search strategy it-self is a bottleneck. In particular, A* is a combination of depth-firstsearch (DFS) and breath-first search (BFS), and is an exhaustivealgorithm without its gain-based pruning. In fact, the experimentsin [29] showed that A* is just 2-3 times faster than the brute-forcealgorithm. In contrast, our algorithm embeds the cost dominancepruning into the basic search strategy to limit the search space to thenon-dominant trips, which is much smaller than the search space ofthe brute-force algorithm, as shown in Section 5.5. In addition, oursearch strategy enables dynamic programming to reuse interme-diate computation results, and a general gain-based upper boundpruning to further prune the search space.

7.4 Case StudyTo get an intuitive comparison of trips found with and without

the diversity requirement specified by α, Figure 11 shows the twobest trips found on the SG data by our method, one with α = 0.5and one with α = 0. The red dots represent the source x anddestination y, and the POIs on a trip are labeled sequentially asA, B · · · , with feature ratings for POIs. With α = 0.5, the firsttrip covers all features specified by the query, i.e., two parks, twomuseums, and two Chinese restaurants, while maximizing the totalGain. With α = 0, the second trip has 4 parks out of 5 POIs,

Hilton Singapore

A: National Museum

B: Fort Canning Park

D: Esplanade Park

F: Tiong Shian Eating House

C: Singapore Art Museum

E: Peach Garden

(a) PACER+2 top-1

Hilton Singapore →A (M:5.0): National Museum →B (P:4.1): Fort Canning Park →C (M:5.0): Singapore Art Museum →D (P:3.1): Esplanade Park →E (R:3.6): Peach Garden →F (R:5.0): Tiong Shian Eating House →Hilton Singapore;Gain = 7.34115; Cost = 540 mins

(b) Top-2 results by PACER+2

Hilton SingaporeA: National Museum

B: Fort Canning Park

E: Bukit Timah Nature Reserve

D: Tampines Eco Green

C: Bedok Reservoir Park

(c) PACER-SC Top-1

Hilton Singapore →A (M:5.0): National Museum →B (P:4.1): Fort Canning Park →C (P:4.0): Bedok Reservoir Park →D (P:5.0): Tampines Eco Green →E (P:4.8): BukitTimah Nature Reserve →Hilton Singapore;Gain = 8.66; Cost = 535 mins

(d) PACER-SC Top-2

Figure 10: Case study

4 5 6 7 8 9Time budget (hour)

10-1

101

103

Run

time

(sec

) PACER+2A*

(a) Run time for SG

4 5 6 7 8 9

Time budget (hour)

103

105

107

109

# o

f tr

ips

PACER+2

A*

(b) Search space for SG

4 5 6 7 8 9Time budget (hour)

10-1

101

103

Run

time

(sec

) PACER+2A*

39/50

(c) Run time for AS

4 5 6 7 8 9

Time budget (hour)

103

105

107

109

# o

f tr

ips

PACER+2

A*

39/50

(d) Search space for AS

Figure 9: PACER+2 vs A*, shown in logarithmic scale.

simple sum aggregation. In this case, the diminishing return effectdisappears, both the absolute value of Gain and the difference ofGain between OPTIMAL and GR are the largest. As α increases,both of them decrease.

7.3 Comparison with A*As stated in Section 7.1.2, the A* algorithm is only workable

under their keyword coverage function, namely replace Eqn. (2)with the following:

fh(PV ) = 1−∏

i∈PV

[1− Fh(i)] (21)

In the original paper [28], Fh(i) is required to be in the range [0, 1]and it is directly set to 1 if the number of check-ins on POI i forfeature h is above average. In this case single POI easily reachesthe full coverage of h and do not need to explore other POIs havingh any more. This we believe is the main reason that the run time issmall in their experiments. Besides, the budget in [28] is measuredin distance and the maximum budget is 15 kilometers in their effi-ciency study, which is around 20 minutes by Google Maps underdriving mode and just captures a small area of a city.

To thoroughly measure the scalablility of the algorithms, we fol-low Eqn. (18) to aggregate feature ratings but set β = 0.5, and alsouse time budget and feed A* with the same POI maps. A* in [28]blindly enumerates all nodes that are adjacent to the current node,regardless whether the node has the required features. Besides,the pair-wise shorted traveling cost is computed in pre-processingphase. To speed up their method and to be fair when comparingwith our method, we modify their algorithm to also equip our in-dexing methods, then the only differences are the search and prun-ing strategies.

Figure 9 shows the comparison between PACER+2 and the mod-ified A* on both datasets. Since they both find optimal trips withsame Gain, we only need to compare their efficiency. Obviously,PACER+2 outperforms A*, especially when b is large. Severalqueries of A* on AS data even fail when b = 9 hours. Although A*is embedded with carefully designed pruning strategy specificallyfor Eqn. (21), the search strategy itself is a bottleneck. As is known,the embedded heuristic function makes A* to be a combination ofdepth-first search (DFS) and breath-first search (BFS), but essen-tially A* is still an exhaustive algorithm, thus its time complexityshould be at the same order of DFS or BFS. The experiments in[28] also show that A* is just 2-3 times faster than the brute-forcealgorithm. As for PACER+2, it implements the idea of dynamicprogramming to reuse intermediate computation results, besides,

Hilton Singapore

A: National Museum

B: Fort Canning Park

D: Esplanade Park

F: Tiong Shian Eating House

C: Singapore Art Museum

E: Peach Garden

(a) PACER+2 top-1

Hilton Singapore →A (M:5.0): National Museum →B (P:4.1): Fort Canning Park →C (M:5.0): Singapore Art Museum →D (P:3.1): Esplanade Park →E (R:3.6): Peach Garden →F (R:5.0): Tiong Shian Eating House →Hilton Singapore;Gain = 7.34115; Cost = 540 mins

(b) Top-2 results by PACER+2

Hilton SingaporeA: National Museum

B: Fort Canning Park

E: Bukit Timah Nature Reserve

D: Tampines Eco Green

C: Bedok Reservoir Park

(c) PACER-SC Top-1

Hilton Singapore →A (M:5.0): National Museum →B (P:4.1): Fort Canning Park →C (P:4.0): Bedok Reservoir Park →D (P:5.0): Tampines Eco Green →E (P:4.8): Bukit Timah Nature Reserve →Hilton SingaporeGain = 8.66; Cost = 535 mins

(d) PACER-SC Top-2

Figure 10: Case study

both the more general gain-based upper bound pruning and the costdominance pruning play important roles. The complexity analysisin Section 5.4 also shows that the computation cost of our algorithmis much less than brute-force on theoretical level.

7.4 Case StudyWe show a real world case study as in Figure 10 based on the

experiments on SG data. The query can be found in the caption ofthe figure. We compare the top trips returned by PACER+2 underdifferent diversity requirements by setting α = 0.5 and α = 0respectively. Due to space limitation, we only show the top-1 tripfor each. We visualize them in the left-side figures, where the reddots represent the source and destination and it visits the POIs withlabels A, B · · · as the alphabetical order. The right-side figuresshow the detailed trip sequences with feature ratings on POIs, aswell as the Gain and Cost of the trips.

As is shown, when α is set as 0.5, the top-1 trip covers two parks,two museums and two Chinese restaurants. That is the algorithm isenable to find diverse trips while maximizing the totalGain. Whendiversity is not enforced by setting α = 0, the top-1 trip mainlyfocus on the feature “Park” (4 out of 5) due to its higher weightby the user, thus, lose the diversity and “Chinese Restaurant” istotally not covered. For the trip in Figure ??, if we set α = 0.5 andrecompute its Gain, it is only 6.6045, which is smaller than that ofthe trip in Figure ??.

8. CONCLUSIONS

9. REFERENCES[1] T. Back, D. B. Fogel, and Z. Michalewicz. Handbook of

evolutionary computation. New York: Oxford, 1997.[2] S. Basu Roy, G. Das, S. Amer-Yahia, and C. Yu. Interactive

itinerary planning. In ICDE, pages 15–26. IEEE, 2011.[3] D. M. Burton. Elementary number theory. Tata McGraw-Hill

Education, 2006.[4] X. Cao, L. Chen, G. Cong, and X. Xiao. Keyword-aware

optimal route search. Proceedings of the VLDB Endowment,5(11):1136–1147, 2012.

[5] X. Cao, G. Cong, C. S. Jensen, and B. C. Ooi. Collectivespatial keyword querying. In Proceedings of the 2011 ACMSIGMOD, pages 373–384. ACM, 2011.

(a) α = 0.5 (with diversity requirement)

4 5 6 7 8 9Time budget (hour)

10-1

101

103

Run

time

(sec

) PACER+2A*

(a) Run time for SG

4 5 6 7 8 9

Time budget (hour)

103

105

107

109

# o

f tr

ips

PACER+2

A*

(b) Search space for SG

4 5 6 7 8 9Time budget (hour)

10-1

101

103

Run

time

(sec

) PACER+2A*

39/50

(c) Run time for AS

4 5 6 7 8 9

Time budget (hour)

103

105

107

109

# o

f tr

ips

PACER+2

A*

39/50

(d) Search space for AS

Figure 9: PACER+2 vs A*, shown in logarithmic scale.

simple sum aggregation. In this case, the diminishing return effectdisappears, both the absolute value of Gain and the difference ofGain between OPTIMAL and GR are the largest. As α increases,both of them decrease.

7.3 Comparison with A*As stated in Section 7.1.2, the A* algorithm is only workable

under their keyword coverage function, namely replace Eqn. (2)with the following:

fh(PV ) = 1−∏

i∈PV

[1− Fh(i)] (21)

In the original paper [28], Fh(i) is required to be in the range [0, 1]and it is directly set to 1 if the number of check-ins on POI i forfeature h is above average. In this case single POI easily reachesthe full coverage of h and do not need to explore other POIs havingh any more. This we believe is the main reason that the run time issmall in their experiments. Besides, the budget in [28] is measuredin distance and the maximum budget is 15 kilometers in their effi-ciency study, which is around 20 minutes by Google Maps underdriving mode and just captures a small area of a city.

To thoroughly measure the scalablility of the algorithms, we fol-low Eqn. (18) to aggregate feature ratings but set β = 0.5, and alsouse time budget and feed A* with the same POI maps. A* in [28]blindly enumerates all nodes that are adjacent to the current node,regardless whether the node has the required features. Besides,the pair-wise shorted traveling cost is computed in pre-processingphase. To speed up their method and to be fair when comparingwith our method, we modify their algorithm to also equip our in-dexing methods, then the only differences are the search and prun-ing strategies.

Figure 9 shows the comparison between PACER+2 and the mod-ified A* on both datasets. Since they both find optimal trips withsame Gain, we only need to compare their efficiency. Obviously,PACER+2 outperforms A*, especially when b is large. Severalqueries of A* on AS data even fail when b = 9 hours. Although A*is embedded with carefully designed pruning strategy specificallyfor Eqn. (21), the search strategy itself is a bottleneck. As is known,the embedded heuristic function makes A* to be a combination ofdepth-first search (DFS) and breath-first search (BFS), but essen-tially A* is still an exhaustive algorithm, thus its time complexityshould be at the same order of DFS or BFS. The experiments in[28] also show that A* is just 2-3 times faster than the brute-forcealgorithm. As for PACER+2, it implements the idea of dynamicprogramming to reuse intermediate computation results, besides,

Hilton Singapore

A: National Museum

B: Fort Canning Park

D: Esplanade Park

F: Tiong Shian Eating House

C: Singapore Art Museum

E: Peach Garden

(a) PACER+2 top-1

Hilton Singapore →A (M:5.0): National Museum →B (P:4.1): Fort Canning Park →C (M:5.0): Singapore Art Museum →D (P:3.1): Esplanade Park →E (R:3.6): Peach Garden →F (R:5.0): Tiong Shian Eating House →Hilton Singapore;Gain = 7.34115; Cost = 540 mins

(b) Top-2 results by PACER+2

Hilton SingaporeA: National Museum

B: Fort Canning Park

E: Bukit Timah Nature Reserve

D: Tampines Eco Green

C: Bedok Reservoir Park

(c) PACER-SC Top-1

Hilton Singapore →A (M:5.0): National Museum →B (P:4.1): Fort Canning Park →C (P:4.0): Bedok Reservoir Park →D (P:5.0): Tampines Eco Green →E (P:4.8): Bukit Timah Nature Reserve →Hilton SingaporeGain = 8.66; Cost = 535 mins

(d) PACER-SC Top-2

Figure 10: Case study

both the more general gain-based upper bound pruning and the costdominance pruning play important roles. The complexity analysisin Section 5.4 also shows that the computation cost of our algorithmis much less than brute-force on theoretical level.

7.4 Case StudyWe show a real world case study as in Figure 10 based on the

experiments on SG data. The query can be found in the caption ofthe figure. We compare the top trips returned by PACER+2 underdifferent diversity requirements by setting α = 0.5 and α = 0respectively. Due to space limitation, we only show the top-1 tripfor each. We visualize them in the left-side figures, where the reddots represent the source and destination and it visits the POIs withlabels A, B · · · as the alphabetical order. The right-side figuresshow the detailed trip sequences with feature ratings on POIs, aswell as the Gain and Cost of the trips.

As is shown, when α is set as 0.5, the top-1 trip covers two parks,two museums and two Chinese restaurants. That is the algorithm isenable to find diverse trips while maximizing the totalGain. Whendiversity is not enforced by setting α = 0, the top-1 trip mainlyfocus on the feature “Park” (4 out of 5) due to its higher weightby the user, thus, lose the diversity and “Chinese Restaurant” istotally not covered. For the trip in Figure ??, if we set α = 0.5 andrecompute its Gain, it is only 6.6045, which is smaller than that ofthe trip in Figure ??.

8. CONCLUSIONS

9. REFERENCES[1] T. Back, D. B. Fogel, and Z. Michalewicz. Handbook of

evolutionary computation. New York: Oxford, 1997.[2] S. Basu Roy, G. Das, S. Amer-Yahia, and C. Yu. Interactive

itinerary planning. In ICDE, pages 15–26. IEEE, 2011.[3] D. M. Burton. Elementary number theory. Tata McGraw-Hill

Education, 2006.[4] X. Cao, L. Chen, G. Cong, and X. Xiao. Keyword-aware

optimal route search. Proceedings of the VLDB Endowment,5(11):1136–1147, 2012.

[5] X. Cao, G. Cong, C. S. Jensen, and B. C. Ooi. Collectivespatial keyword querying. In Proceedings of the 2011 ACMSIGMOD, pages 373–384. ACM, 2011.

(b) α = 0 (without diversity requirement)

Figure 11: Two trips found from the SG data by PACER+2 for thequery Q = (x, y, b = 9,w = (P : 0.4,M : 0.3, R : 0.3), θ =2.5, α), where x and y are Hilton Singapore, and P, M and R rep-resent Park, Museum, and Chinese Restaurant. All features with arating less than θ are not shown.

thanks to the higher weight of Park in w, while maximizing theGain. Though theGain value of the second trip is higher than thatof the first trip, it is less preferred by a user who values the diversity.In fact, to the user with the diversity requirement specified by α =0.5, the second trip’s Gain value is only 6.6045, less than that ofthe first trip. This study clearly shows the usefulness of modelingthe user’s diversity requirement on a trip.

8. CONCLUSION AND EXTENSIONThis work considered a personalized top-k trip search problem

with several practical settings: a POI map containing geograph-ically located POIs described by features, user preferences spec-ified by a query through feature weighting, the budget constraint,and the diversity requirement on POIs. The large scale of POI mapsand the combination of search in feature space, spatial space, andpath space make finding top-k trips computationally hard. The per-sonalized diversity requirement further demands a general searchalgorithm that works for any reasonable diversity specification. Wepresented an exact solution addressing these challenges by multiple

(b) α = 0 (without diversity requirement)

Figure 5.7: Two routes found from Singapore by PACER+2 for the query Q = (x, y, b = 9,w =(P : 0.4,M : 0.3, R : 0.3),θ = 2.5,α), where x and y are Hilton Singapore, and P, M and Rrepresent Park, Museum, and Chinese Restaurant.

In fact, the second route’s Gain value when evaluated using α = 0.5 is only 6.60. By modeling the

diversity requirement, our method correctly returns the preferred first route.

Impact of k

Figures 5.5j - 5.5l and 5.6j - 5.6l show the performance of all the algorithms, except GR and AP

who can find only top-1, under various setting of k while fixing b,θ,α at the default values. As k

only influences the gain-based pruning, the performance of BF and PACER+1 are unchanged. For

PACER+2 and PACER-SC, the change is limited. Because when k is small, the Gain of the k-th best

route is usually not far away to that of the best route, thus, the marginal gain upper bound pruning

is not seriously influenced. Another interesting result shown in Figure 5.6l is that as k increases, the

Gain of the heuristic PACER-SC approaches the optimal gradually. This is because while PACER-

SC does not guarantee to maintain the best k routes found so far, a larger k will make the marginal

upper bound pruning less aggressive so that many open routes actually get the opportunity to “grow

longer” and achieve higher gain, instead of being pruned falsely.

Discussion. As is shown, in fact our exact algorithm PACER practically runs much faster than

the state-of-art approximation algorithm with a theoretical approximation guarantee. Therefore in

real applications, when budget b and |VQ| are not large, we prefer PACER that returns best solutions

in reasonable responding time. When b and |VQ| are relatively large, we switch to the collapse

heuristic version of PACER, which sacrifices the optimality a little bit but runs much faster. And

89

Page 101: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

4 5 6 7 8 9

Time budget b (hour)

10-1

101

103

Ru

ntim

e (

se

c) PACER+2

A*

(a) Runtime - Singapore

4 5 6 7 8 9

Time budget b (hour)

103

105

107

109

# o

f ro

ute

s

PACER+2

A*

(b) # of routes - Singapore

4 5 6 7 8 9

Time budget b (hour)

10-1

101

103

Ru

ntim

e (

se

c) PACER+2

A*

39/50

(c) Runtime - Austin

4 5 6 7 8 9

Time budget b (hour)

103

105

107

109

# o

f ro

ute

s

PACER+2

A*

39/50

(d) # of routes - Austin

Figure 5.8: PACER+2 vs. A* (logarithmic scale).

when b and |VQ| are very large, we switch to the fastest greedy algorithm, which can usually return

an acceptable solution.

5.6.3 Comparison with A*

A* [120] only works for their keyword coverage function: Φh(PV ) = 1 −∏i∈PV

[1 − Fi,h], and

finds single route. In [120], Fi,h is in the range [0, 1] and it is set to 1 if the number of check-ins on

POI i for feature h is above average. In this case, the single POI in P yields the maximum Φh(PV )value; the feature h of other POIs will be ignored. For a fair comparison, we set β = 0.5 in Eqn.

(5.14) for both algorithms, we also leverage our indices to speed up A*. Note that the maximum b

in [120] is 15 kilometers in their efficiency study, which is about 20 minutes by Google Maps under

driving mode.

Figure 5.8 shows the comparison between PACER+2 and the modified A* on both datasets. The

report ofGain is omitted as they are both exact algorithms. Apparently, PACER+2 outperforms A*,

especially for a large b, where PACER+2 is two orders of magnitude faster than A*. Several queries

of A* on Austin even failed for b = 9 hours. Although A* has a pruning strategy specifically for

their keyword coverage function, the search strategy itself is a bottleneck. Besides, their pruning

based on the greedy algorithm in [46] has a bound looser than ours. In fact, the experiments in [120]

showed that A* is just 2-3 times faster than the brute-force algorithm.

90

Page 102: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

5.7 Summary and Extensions

We considered a personalized top-k route search problem in this Chapter. The large scale of POI

maps and the combination of search in feature space and path space make this problem computation-

ally hard. The personalized route diversity requirement further demands a solution that works for

any reasonable route diversity specification. We presented an exact search algorithm with multiple

pruning strategies to address these challenges, as well as high-performance heuristic solutions. The

experiments suggested that our solutions significantly outperform the state-of-the-art algorithms.

We introduce several possible extensions of this work.

Feature order constraints. Some user may have certain order preference of the features in a

route, e.g., at least one POI with feature “Food" should be visited before the POIs with “Shopping".

All this kind of feature partial orders can be represented by a topological sorting. Once an open

route violates any order constraint, the violation cannot be removed by appending more POIs to

its end, thus, it is safe to prune all the routes that are extension of the open route. That is, the

order constraints are anti-monotone. Our algorithms presented in Sections 5.4 and 5.5 can be easily

adopted to prune the search space for any additional anti-monotone constraint.

Feature combination requirements. The following (not limited to) features combination re-

quirements may be interesting in real application. i. Two POIs having the same feature A cannot be

visited consecutively; ii. Either both feature A and B are visited, or none is visited; iii. Exactly only

one of A and B is visited, cannot both. With these constraints, less routes are satisfied, thus, extra

pruning is introduced.

91

Page 103: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Chapter 6

Conclusion

6.1 Summary of the Thesis

User behavioral data are ubiquitous in online social media and are the volume of such data are keep

growing in an incredibly fast speed. Studying user behaviors and mining the tremendous knowledge

hidden in the massive behavioral data benefits both service providers and users of online social

media. While the user behaviors in online social media can be categorized into multiple classes, we

are specifically interested in the social connectivity/interaction behaviors and the mobility behaviors

that involve rich semantic information on nodes and edges of the social networks and are related to

many emerging research topics.

In this thesis, we firstly summarized the typical topics of exploring social connectivity/interaction

behaviors and mobility behaviors, respectively, from a bird’s eye view; then we proposed three

pieces of works of mining the two kinds of user behaviors and made the following contributions.

• (Chapter 3) We studied a problem of mining the strong social network group relationships that

do not follow the homophily principle, where homophily is the phenomenon that similar or

link-minded individuals are more likely to connect each other. While literature largely focuses

on applications based on the homophily principle, such as the community detection and CF

based methods in recommender systems, and the social ties following homophily are usually

well-expected, our work of mining such interesting non-homophily strong ties initiates a new

angel and opens the door to an array of research problems of analyzing the social network

from a different point of view. Specifically in our proposed problem, we proposed a novel

ranking metric, non-homophily preference, to identify the strong non-homophily group social

ties and developed an efficient algorithm GRMiner for discovering the top-k non-homophily

group social ties.

• (Chapter 4) We proposed a novel trip recommendation problem by taking into account user’s

personalized preference and multiple real-world constraints to make the recommendation re-

sults more practical and interesting. The constraints includes the user’s start/destination lo-

cation, time budget, time window for the POI availability, the uncertainty of traveling time

92

Page 104: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

between POIs, the least number of different category of POI the user may require for touring

diversity. Solving the trip recommendation as a discrete optimization by considering all these

constraints is quite challenging. We developed two efficient exact solutions that guarantee the

optimality of the found trips and also present two heuristic solutions for finding “good trips”

with a significantly better runtime that the optimal solutions.

• (Chapter 5) We further proposed a more general on-demand route search problem with the

awareness of personalized diversity requirements on POI features, by standing on the shoulder

of the trip recommendation problem above, and realizing that users usually have too limited

historical data to learn accurate personalized preference on POIs and peoples’ mind dynam-

ically change over time. We proposed to model the user’s personalized quantity and variety

trade-off by a general class of submodular functions that allow user to specify the way of

diminishing the marginal utility obtained by visiting each additional POI having the recurrent

feature. We designed an elegant optimal algorithm that deal with any submodular objective

functions and incorporate multiple pruning strategies, especially a tight utility upper bounding

strategy, for pruning unpromising routes. We also presented heuristic algorithms that provide

answers of a competitive quality and work efficiently for a larger POI map and/or a looser

constraint.

6.2 Future Directions

Besides the direct extensions of our works in the thesis (presented in the last section of each chapter),

the research of this thesis implies many promising future directions. Several interesting ones are

listed as below.

Non-homophily Meets Other Applications

In addition to mining unexpected interesting group social ties by modeling the non-homophily pref-

erence, we see the potential to also break the boundary of homophily in other applications. One

application is the recommender system. As is summarized, currently the most popular recommen-

dation algorithms are the Collaborative Filtering based approaches, which imply the usage of ho-

mophily principle. It is already claimed in [42] that “recommending popular items is unlikely to

result in more gain than discovering insignificant yet liked items because the popular ones might be

already known to the user”. And some recent works have attempted to make recommendations be-

yond homophily, for example, [74] infers networks of product relationships to recommend comple-

mentary products in addition to substitutes (similar products), [101] makes social recommendations

based on the strong-weak tie theory. It is worth exploring more on this trend to see insightful and

surprisingly interesting recommendation results.

Heterophilious networks are better to promote and spread innovations, according to [82]. Hence,

other potential applications of non-homophily can be the team building suggestions for business/start-

ups, competitions or other collaborations, and the seeds selection in information propagation topics.

93

Page 105: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Mining More Complex Social Ties Patterns and From Dynamic Social Networks

In Chapter 3, we mined the top group relationships (GR) from an entire static social network, and the

mined GRs are independent to each other. Many more challenging and useful tasks can be studied.

One task can be, instead of mining stand-alone top group GRs, we can directly mine the patterns

of social tie with more complex structures. As discussed in the motivation and experiments part

of our work, a single top GR may not reveal much insightful information, but it becomes quite

interesting when comparing with other related GRs, for example, the discussion for P5 in Section

3.5.2. Thus, one useful application is to directly mine pair-wise interesting social ties, for example,

the two ties having the same LHS but different values for the same attribute on the RHS and the

relative interestingness (by certain measurement) difference is bigger than a threshold. In addition

to pair-wise patterns, other complex structures can be also interesting.

Our current mining task is conducted on a static social network. However, the real-world social

networks are dynamically changing in a fast speed. The following problems considering the time-

series analysis can be practically useful. (1) We still consider the problem of mining strong GRs

from the entire social network. But instead of recomputing for the entire graph when the structure

of the social network is changed, we only compute the marginal changed part with small cost and

accurately obtain the real-time top GR results based on the historical results. (2) Find the significant

migrating patterns of user interests when making connections or interactions over time.

Neural Network Models for Route Recommendation

(Deep) Neural Network models have gained great success in the fields of computer vision and

nature language processing in recent years. There are also several recent works applying neural

network models to recommend top-N sequential items that a user likely interacts with in a near

future [92]. We see the potential of embedding such models into route recommendation methods.

The difference between sequential items recommendation and route recommendation is that the lat-

ter also considers many spatial constraints, in addition to the feature representations of items/POIs

and the time stamps. The challenges brought in by the spatial constraints can be dealt with by

different ways. One possible way is to regard the route recommendations as a regular successive

next POI prediction/recommendation task and apply the neural network models to make the predic-

tion/recommendation at each step and check whether the spatial constraints and budget are satisfied.

Another way is to borrow the idea of how the neural networks play the game of Go [85] to solve the

task of route recommendation. In particular, bring in a “policy network” to suggest the move and

use a “value network” for pruning. The recent hot method, Generative Adversarial Network (GAN),

also has potential to be used in recommending interesting personalized travel routes.

94

Page 106: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Bibliography

[1] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules between setsof items in large databases. In ACM SIGMOD Record, volume 22, pages 207–216. ACM,1993.

[2] Yuki Arase, Xing Xie, Takahiro Hara, and Shojiro Nishio. Mining people’s trips from largescale geo-tagged photos. In Proceedings of the international conference on Multimedia,pages 133–142. ACM, 2010.

[3] Liliana Ardissono, Anna Goy, Giovanna Petrone, Marino Segnan, and Pietro Torasso. In-trigue: personalized recommendation of tourist attractions for desktop and hand held devices.Applied Artificial Intelligence, 17(8-9):687–714, 2003.

[4] Jie Bao, Yu Zheng, David Wilkie, and Mohamed Mokbel. Recommendations in location-based social networks: a survey. Geoinformatica, 19(3):525–565, 2015.

[5] Jinling Bao, Xingshan Liu, Rui Zhou, and Bin Wang. Keyword-aware optimal location queryin road network. In International Conference on Web-Age Information Management, pages164–177. Springer, 2016.

[6] Senjuti Basu Roy, Gautam Das, Sihem Amer-Yahia, and Cong Yu. Interactive itinerary plan-ning. In ICDE, pages 15–26. IEEE, 2011.

[7] Roberto J Bayardo Jr and Rakesh Agrawal. Mining the most interesting rules. In Proceedingsof the fifth ACM SIGKDD international conference on Knowledge discovery and data mining,pages 145–154. ACM, 1999.

[8] Kevin Beyer and Raghu Ramakrishnan. Bottom-up computation of sparse and iceberg cube.In ACM SIGMOD Record, volume 28, pages 359–370. ACM, 1999.

[9] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journalof machine Learning research, 3:993–1022, 2003.

[10] Adi Botea, Evdokia Nikolova, and Michele Berlingerio. Multi-modal journey planning in thepresence of uncertainty. In ICAPS, 2013.

[11] David M Burton. Elementary number theory. Tata McGraw-Hill Education, 2006.

[12] Xin Cao, Lisi Chen, Gao Cong, and Xiaokui Xiao. Keyword-aware optimal route search.Proceedings of the VLDB Endowment, 5(11):1136–1147, 2012.

[13] Xin Cao, Gao Cong, and Christian S Jensen. Retrieving top-k prestige-based relevant spatialweb objects. Proceedings of the VLDB Endowment, 3(1-2):373–384, 2010.

95

Page 107: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

[14] Xin Cao, Gao Cong, Christian S Jensen, and Beng Chin Ooi. Collective spatial keywordquerying. In Proceedings of the 2011 ACM SIGMOD International Conference on Manage-ment of data, pages 373–384. ACM, 2011.

[15] Deepayan Chakrabarti and Christos Faloutsos. Graph mining: Laws, generators, and algo-rithms. ACM Computing Surveys (CSUR), 38(1):2, 2006.

[16] Chandra Chekuri, Nitish Korula, and Martin Pál. Improved algorithms for orienteering andrelated problems. ACM Transactions on Algorithms (TALG), 8(3):23, 2012.

[17] Chandra Chekuri and Martin Pal. A recursive greedy algorithm for walks in directed graphs.In FOCS, pages 245–253. IEEE, 2005.

[18] Chen Chen, Xifeng Yan, Feida Zhu, Jiawei Han, and Philip S Yu. Graph olap: Towards onlineanalytical processing on graphs. In Proceedings of the Eighth IEEE International Conferenceon Data Mining (ICDM’08), pages 103–112. IEEE, 2008.

[19] Wei Chen, Yajun Wang, and Siyu Yang. Efficient influence maximization in social networks.In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discoveryand data mining, pages 199–208. ACM, 2009.

[20] Chen Cheng, Haiqin Yang, Irwin King, and Michael R Lyu. Fused matrix factorization withgeographical and social influence in location-based social networks. In AAAI, volume 12,pages 17–23, 2012.

[21] Chen Cheng, Haiqin Yang, Michael R Lyu, and Irwin King. Where you like to go next: Suc-cessive point-of-interest recommendation. In IJCAI, pages 2605–2611. AAAI Press, 2013.

[22] Zhiyuan Cheng, James Caverlee, Krishna Yeswanth Kamath, and Kyumin Lee. Towardtraffic-driven location-based web search. In CIKM, pages 805–814, 2011.

[23] Thomas H Cormen. “8.2 Counting Sort", Introduction to algorithms (2nd ed.). MIT press,2009.

[24] Justin Cranshaw, Eran Toch, Jason Hong, Aniket Kittur, and Norman Sadeh. Bridging thegap between physical location and online social networks. In Proceedings of the 12th ACMinternational conference on Ubiquitous computing, pages 119–128. ACM, 2010.

[25] Jian Dai, Bin Yang, Chenjuan Guo, and Zhiming Ding. Personalized route recommenda-tion using big trajectory data. In Data Engineering (ICDE), 2015 IEEE 31st InternationalConference on, pages 543–554. IEEE, 2015.

[26] Munmun De Choudhury, Moran Feldman, Sihem Amer-Yahia, Nadav Golbandi, Ronny Lem-pel, and Cong Yu. Automatic construction of travel itineraries using social breadcrumbs. InProceedings of the 21st ACM conference on Hypertext and hypermedia, pages 35–44. ACM,2010.

[27] Luc Dehaspe and Hannu Toivonen. Discovery of frequent datalog patterns. Data Mining andknowledge discovery, 3(1):7–36, 1999.

[28] Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alexander J Smola, Jing Jiang, and ChongWang. Jointly modeling aspects, ratings and sentiments for movie recommendation (jmars).In the 20th ACM SIGKDD, pages 193–202. ACM, 2014.

96

Page 108: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

[29] Yuxiao Dong, Yang Yang, Jie Tang, Yang Yang, and Nitesh V Chawla. Inferring user demo-graphics and social strategies in mobile social networks. In SIGKDD, 2014.

[30] Robert W Floyd. Algorithm 97: shortest path. Communications of the ACM, 5(6):345, 1962.

[31] Santo Fortunato. Community detection in graphs. Physics reports, 486(3):75–174, 2010.

[32] Sébastien Gambs, Marc-Olivier Killijian, and Miguel Núñez del Prado Cortez. Next placeprediction using mobility markov chains. In Proceedings of the First Workshop on Measure-ment, Privacy, and Mobility, page 3. ACM, 2012.

[33] Yong Ge, Qi Liu, Hui Xiong, Alexander Tuzhilin, and Jian Chen. Cost-aware travel tourrecommendation. In Proceedings of the 17th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 983–991. ACM, 2011.

[34] Aristides Gionis, Theodoros Lappas, Konstantinos Pelechrinis, and Evimaria Terzi. Cus-tomized tour recommendations in urban areas. In WSDM, pages 313–322. ACM, 2014.

[35] Bruce L Golden, Larry Levy, and Rakesh Vohra. The orienteering problem. Naval researchlogistics, 34(3):307–318, 1987.

[36] Joseph E Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Pow-ergraph: Distributed graph-parallel computation on natural graphs. In OSDI, volume 12,page 2, 2012.

[37] Mark Granovetter. Getting a job: A study of contacts and careers. University of ChicagoPress, 1995.

[38] Mark Granovetter. The impact of social structure on economic outcomes. The Journal ofeconomic perspectives, 19(1):33–50, 2005.

[39] Younes Guessous, Maurice Aron, Neila Bhouri, and Simon Cohen. Estimating travel timedistribution under different traffic conditions. Transportation Research Procedia, 3:339–348,2014.

[40] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation.In ACM sigmod record, volume 29, pages 1–12. ACM, 2000.

[41] Bo Hu and Martin Ester. Spatial topic modeling in online social media for location recom-mendation. In RecSys, pages 25–32, 2013.

[42] Tamas Jambor and Jun Wang. Optimizing multiple objectives in collaborative filtering. InProceedings of the fourth ACM conference on Recommender systems, pages 55–62. ACM,2010.

[43] Minhao Jiang, Ada Wai-Chee Fu, Raymond Chi-Wing Wong, and Yanyan Xu. Hop doublinglabel indexing for point-to-point distance querying on scale-free networks. Proceedings ofthe VLDB Endowment, 7(12):1203–1214, 2014.

[44] Long Jin, Yang Chen, Tianyi Wang, Pan Hui, and Athanasios V Vasilakos. Understand-ing user behavior in online social networks: A survey. IEEE Communications Magazine,51(9):144–150, 2013.

97

Page 109: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

[45] David Kempe, Jon Kleinberg, and Éva Tardos. Maximizing the spread of influence througha social network. In Proceedings of the ninth ACM SIGKDD international conference onKnowledge discovery and data mining, pages 137–146. ACM, 2003.

[46] Samir Khuller, Anna Moss, and Joseph Seffi Naor. The budgeted maximum coverage prob-lem. Information Processing Letters, 70(1):39–45, 1999.

[47] Myunghwan Kim and Jure Leskovec. Modeling social networks with node attributes usingthe multiplicative attribute graph model. arXiv preprint arXiv:1106.5053, 2011.

[48] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recom-mender systems. Computer, pages 30–37, 2009.

[49] Andreas Krause and Daniel Golovin. Submodular function maximization. Tractability: Prac-tical Approaches to Hard Problems, 3(19):8, 2012.

[50] Andreas Krause and Carlos Guestrin. Beyond convexity: Submodularity in machine learning.ICML Tutorials, 2008.

[51] Michihiro Kuramochi and George Karypis. Frequent subgraph discovery. In Proceedingsof the 2001 IEEE International Conference on Data Mining (ICDM), pages 313–320. IEEE,2001.

[52] Takeshi Kurashima, Tomoharu Iwata, Takahide Hoshide, Noriko Takaya, and Ko Fujimura.Geo topic model: joint modeling of user’s activity area and interests for location recommen-dation. In WSDM, pages 375–384, 2013.

[53] Takeshi Kurashima, Tomoharu Iwata, Go Irie, and Ko Fujimura. Travel route recommenda-tion using geotags in photo sharing sites. In CIKM, pages 579–588. ACM, 2010.

[54] Andrea Lancichinetti and Santo Fortunato. Community detection algorithms: a comparativeanalysis. Physical review E, 80(5):056117, 2009.

[55] Jure Leskovec, Lada A Adamic, and Bernardo A Huberman. The dynamics of viral market-ing. ACM Transactions on the Web (TWEB), 1(1):5, 2007.

[56] Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, andNatalie Glance. Cost-effective outbreak detection in networks. In Proceedings of the 13thACM SIGKDD, pages 420–429. ACM, 2007.

[57] Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive datasets.Cambridge University Press, 2014.

[58] Cane Wing-ki Leung, Ee-Peng Lim, David Lo, and Jianshu Weng. Mining interesting linkformation rules in social networks. In CIKM, 2010.

[59] Kenneth Wai-Ting Leung, Dik Lun Lee, and Wang-Chien Lee. Clr: a collaborative locationrecommendation framework based on co-clustering. In SIGIR, pages 305–314, 2011.

[60] Jiuyong Li. On optimal rule discovery. IEEE Transactions on Knowledge and Data Engi-neering, 18(4):460–471, 2006.

98

Page 110: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

[61] Nan Li, Ziyu Guan, Lijie Ren, Jian Wu, Jiawei Han, and Xifeng Yan. giceberg: Towardsiceberg analysis in large graphs. In ICDE, 2013.

[62] Wengen Li, Jiannong Cao, Jihong Guan, Man Lung Yiu, and Shuigeng Zhou. Retrievingroutes of interest over road networks. In International Conference on Web-Age InformationManagement, pages 109–123. Springer, 2016.

[63] Hongwei Liang and Ke Wang. Top-k route search through submodularity modeling of re-current poi features. In Proceedings of the 41st International ACM SIGIR Conference onResearch & Development in Information Retrieval. ACM, 2018.

[64] Hongwei Liang, Ke Wang, and Feida Zhu. Mining social ties beyond homophily. In DataEngineering (ICDE), 2016 IEEE 32nd International Conference on, pages 421–432. IEEE,2016.

[65] Kwan Hui Lim, Jeffrey Chan, Shanika Karunasekera, and Christopher Leckie. Personalizeditinerary recommendation with queuing time awareness. In Proceedings of the 40th ACMSIGIR, pages 325–334. ACM, 2017.

[66] Bin Liu, Yanjie Fu, Zijun Yao, and Hui Xiong. Learning geographical preferences for point-of-interest recommendation. In KDD, pages 1043–1051, 2013.

[67] Junqiang Liu, Ke Wang, and Benjamin CM Fung. Direct discovery of high utility itemsetswithout candidate generation. In Data Mining (ICDM), 2012 IEEE 12th International Con-ference on, pages 984–989. IEEE, 2012.

[68] Qi Liu, Yong Ge, Zhongmou Li, Enhong Chen, and Hui Xiong. Personalized travel packagerecommendation. In ICDM, pages 407–416. IEEE, 2011.

[69] Eric Hsueh-Chan Lu, Ching-Yu Chen, and Vincent S Tseng. Personalized trip recommen-dation with multiple constraints by mining user check-in behaviors. In SIGSPATIAL, pages209–218, 2012.

[70] Linyuan Lü and Tao Zhou. Link prediction in complex networks: A survey. Physica A:statistical mechanics and its applications, 390(6):1150–1170, 2011.

[71] Ying Lu, Gregor Jossé, Tobias Emrich, Ugur Demiryurek, Matthias Renz, Cyrus Shahabi,and Matthias Schubert. Scenic routes now: Efficiently solving the time-dependent arc orien-teering problem. pages 487–496, 2017.

[72] Zephoria Digital marketing. The top 20 valuable facebook statistics - updated february 2018.https://zephoria.com/top-15-valuable-facebook-statistics/, 2018.

[73] NA Marlow. A normal limit theorem for power sums of independent random variables. BellSystem Technical Journal, 46(9):2081–2089, 1967.

[74] Julian McAuley, Rahul Pandey, and Jure Leskovec. Inferring networks of substitutable andcomplementary products. In Proceedings of the 21th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining, pages 785–794. ACM, 2015.

[75] Miller McPherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily insocial networks. Annual review of sociology, pages 415–444, 2001.

99

Page 111: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

[76] Clair E Miller, Albert W Tucker, and Richard A Zemlin. Integer programming formulationof traveling salesman problems. Journal of the ACM (JACM), 7(4):326–329, 1960.

[77] Ron Milo, Shai Shen-Orr, Shalev Itzkovitz, Nadav Kashtan, Dmitri Chklovskii, and Uri Alon.Network motifs: simple building blocks of complex networks. Science, 298(5594):824–827,2002.

[78] Mark EJ Newman. The structure and function of complex networks. SIAM review, 45(2):167–256, 2003.

[79] Anastasios Noulas, Salvatore Scellato, Neal Lathia, and Cecilia Mascolo. Mining user mo-bility features for next place prediction in location-based services. In Data mining (ICDM),2012 IEEE 12th international conference on, pages 1038–1043. IEEE, 2012.

[80] Joseph J Pfeiffer III, Sebastian Moreno, Timothy La Fond, Jennifer Neville, and Brian Gal-lagher. Attributed graph models: Modeling network structure with correlated attributes. InProceedings of the 23rd international conference on World wide web, pages 831–842. ACM,2014.

[81] Meng Qu, Hengshu Zhu, Junming Liu, Guannan Liu, and Hui Xiong. A cost-effective rec-ommender system for taxi drivers. In KDD, pages 45–54, 2014.

[82] Everett M Rogers. Diffusion of innovations. Simon and Schuster, 2010.

[83] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. NIPS, pages1257–1264, 2008.

[84] Guy Shani and Asela Gunawardana. Evaluating recommendation systems. In Recommendersystems handbook, pages 257–297. Springer, 2011.

[85] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George VanDen Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanc-tot, et al. Mastering the game of go with deep neural networks and tree search. nature,529(7587):484–489, 2016.

[86] Amarjeet Singh, Andreas Krause, Carlos Guestrin, William J Kaiser, and Maxim A Batalin.Efficient planning of informative paths for multiple robots. In IJCAI, volume 7, pages 2204–2211, 2007.

[87] Yizhou Sun, Rick Barber, Manish Gupta, Charu C Aggarwal, and Jiawei Han. Co-authorrelationship prediction in heterogeneous bibliographic networks. In ASONAM, 2011.

[88] Yizhou Sun and Jiawei Han. Mining heterogeneous information networks: principles andmethodologies. Synthesis Lectures on Data Mining and Knowledge Discovery, 3(2):1–159,2012.

[89] Yizhou Sun and Jiawei Han. Mining heterogeneous information networks: a structural anal-ysis approach. ACM SIGKDD Explorations Newsletter, 2013.

[90] Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S Yu, and Tianyi Wu. Pathsim: Meta path-basedtop-k similarity search in heterogeneous information networks. Proceedings of the VLDBEndowment, 4(11):992–1003, 2011.

100

Page 112: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

[91] Maxim Sviridenko. A note on maximizing a submodular set function subject to a knapsackconstraint. Operations Research Letters, 32(1):41–43, 2004.

[92] Jiaxi Tang and Ke Wang. Personalized top-n sequential recommendation via convolutionalsequence embedding. 2018.

[93] Wenbin Tang, Honglei Zhuang, and Jie Tang. Learning to infer social ties in large networks.In Joint European Conference on Machine Learning and Knowledge Discovery in Databases,pages 381–397. Springer, 2011.

[94] Yuanyuan Tian, Richard A Hankins, and Jignesh M Patel. Efficient aggregation for graphsummarization. In Proceedings of the 2008 ACM SIGMOD international conference on man-agement of data, pages 567–580. ACM, 2008.

[95] Amanda L Traud, Peter J Mucha, and Mason A Porter. Social structure of facebook networks.Physica A: Statistical Mechanics and its Applications, 391(16):4165–4180, 2012.

[96] World Travel and Tourism Council. Travel and tourism global economic impact and issues2017. https://www.wttc.org/, 2017.

[97] Theodore Tsiligirides. Heuristic methods applied to orienteering. Journal of the OperationalResearch Society, pages 797–809, 1984.

[98] Pieter Vansteenwegen, Wouter Souffriau, and Dirk Van Oudheusden. The orienteering prob-lem: A survey. European Journal of Operational Research, 209(1):1–10, 2011.

[99] Chong Wang and David M. Blei. Collaborative topic modeling for recommending scientificarticles. In KDD, pages 448–456, 2011.

[100] Ke Wang, Yuelong Jiang, and Laks VS Lakshmanan. Mining unexpected rules by push-ing user dynamics. In Proceedings of the ninth ACM SIGKDD international conference onKnowledge discovery and data mining, pages 246–255. ACM, 2003.

[101] Xin Wang, Wei Lu, Martin Ester, Can Wang, and Chun Chen. Social recommendation withstrong and weak ties. In Proceedings of the 25th ACM International on Conference on Infor-mation and Knowledge Management, pages 5–14. ACM, 2016.

[102] Stanley Wasserman and Katherine Faust. Social network analysis: Methods and applications,volume 8. Cambridge university press, 1994.

[103] Geoffrey I Webb and Jilles Vreeken. Efficient discovery of the most interesting associations.ACM Transactions on Knowledge Discovery from Data (TKDD), 8(3):15, 2014.

[104] Ling-Yin Wei, Yu Zheng, and Wen-Chih Peng. Constructing popular routes from uncertaintrajectories. In Proceedings of the 18th ACM SIGKDD international conference on Knowl-edge discovery and data mining, pages 195–203. ACM, 2012.

[105] Jianshu Weng, Ee-Peng Lim, Jing Jiang, and Qi He. Twitterrank: finding topic-sensitiveinfluential twitterers. In Proceedings of the third ACM international conference on Websearch and data mining, pages 261–270. ACM, 2010.

101

Page 113: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

[106] Bradford S Westgate, Dawn B Woodard, David S Matteson, Shane G Henderson, et al. Traveltime estimation for ambulances using bayesian data augmentation. The Annals of AppliedStatistics, 7(2):1139–1161, 2013.

[107] Yao Wu. Towards Better User Preference Learning for Recommender Systems. PhD thesis,SIMON FRASER UNIVERSITY, 2016.

[108] Min Xie, Laks VS Lakshmanan, and Peter T Wood. Breaking out of the box of recommenda-tions: from items to packages. In Proceedings of the fourth ACM conference on Recommendersystems, pages 151–158. ACM, 2010.

[109] Min Xie, Laks VS Lakshmanan, and Peter T Wood. Comprec-trip: A composite recommen-dation system for travel planning. In ICDE, pages 1352–1355. IEEE, 2011.

[110] Xiaowei Xu, Nurcan Yuruk, Zhidan Feng, and Thomas AJ Schweiger. Scan: a structuralclustering algorithm for networks. In Proceedings of the 13th ACM SIGKDD internationalconference on Knowledge discovery and data mining, pages 824–833. ACM, 2007.

[111] Xifeng Yan and Jiawei Han. gspan: Graph-based substructure pattern mining. In Proceedingsof the 2002 IEEE International Conference on Data Mining (ICDM), pages 721–724. IEEE,2002.

[112] Jaewon Yang, Julian McAuley, and Jure Leskovec. Community detection in networks withnode attributes. In Proceedings of the 13th IEEE international conference on Data Mining(ICDM),, pages 1151–1156. IEEE, 2013.

[113] Yu Yang, Xiangbo Mao, Jian Pei, and Xiaofei He. Continuous influence maximization: Whatdiscounts should we offer to social network users? In Proceedings of the 2016 InternationalConference on Management of Data, pages 727–741. ACM, 2016.

[114] Mao Ye, Peifeng Yin, and Wang-Chien Lee. Location recommendation for location-basedsocial networks. In GIS, pages 458–461, 2010.

[115] Hongzhi Yin, Yizhou Sun, Bin Cui, Zhiting Hu, and Ling Chen. Lcars: a location-content-aware recommender system. In KDD, pages 221–229, 2013.

[116] Hyoseok Yoon, Yu Zheng, Xing Xie, and Woontack Woo. Social itinerary recommenda-tion from user-generated digital trails. Personal and Ubiquitous Computing, 16(5):469–484,2012.

[117] Xiao Yu, Xiang Ren, Yizhou Sun, Quanquan Gu, Bradley Sturt, Urvashi Khandelwal, Bran-don Norick, and Jiawei Han. Personalized entity recommendation: A heterogeneous infor-mation network approach. In Proceedings of the 7th ACM international conference on Websearch and data mining, pages 283–292. ACM, 2014.

[118] Jing Yuan, Yu Zheng, and Xing Xie. Discovering regions of different functions in a city usinghuman mobility and pois. In KDD, pages 186–194, 2012.

[119] Reza Zafarani, Mohammad Ali Abbasi, and Huan Liu. Social media mining: an introduction.Cambridge University Press, 2014.

102

Page 114: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

[120] Yifeng Zeng, Xuefeng Chen, Xin Cao, Shengchao Qin, Marc Cavazza, and Yanping Xiang.Optimal route search with the coverage of users’ preferences. In Proceedings of the 24thInternational Conference on Artificial Intelligence, pages 2118–2124. AAAI Press, 2015.

[121] Chao Zhang, Jiawei Han, Lidan Shou, Jiajun Lu, and Thomas La Porta. Splitter: Mining fine-grained sequential patterns in semantic trajectories. Proceedings of the VLDB Endowment,7(9):769–780, 2014.

[122] Chenyi Zhang, Hongwei Liang, and Ke Wang. Trip recommendation meets real-world con-straints: Poi availability, diversity, and traveling time uncertainty. ACM Transactions on In-formation Systems (TOIS), 35(1):5, 2016.

[123] Chenyi Zhang, Hongwei Liang, Ke Wang, and Jianling Sun. Personalized trip recommen-dation with poi availability and uncertain traveling time. In Proceedings of the 24th ACMInternational on Conference on Information and Knowledge Management, pages 911–920.ACM, 2015.

[124] Chenyi Zhang, Ke Wang, Ee-peng Lim, Qinneng Xu, Jianling Sun, and Hongkun Yu. Arefeatures equally representative? a feature-centric recommendation. In AAAI, 2015.

[125] Ning Zhang, Yuanyuan Tian, and Jignesh M Patel. Discovery-driven graph summarization.In ICDE, 2010.

[126] Wei Zhang and Jianyong Wang. Location and time aware social collaborative retrieval fornew successive point-of-interest recommendation. In CIKM, pages 1221–1230. ACM, 2015.

[127] Peixiang Zhao, Xiaolei Li, Dong Xin, and Jiawei Han. Graph cube: on warehousing andolap multidimensional networks. In Proceedings of the 2011 ACM SIGMOD InternationalConference on Management of data, pages 853–864. ACM, 2011.

[128] Bolong Zheng, Nicholas Jing Yuan, Kai Zheng, Xing Xie, Shazia Sadiq, and Xiaofang Zhou.Approximate keyword search in semantic trajectory database. In 31st International Confer-ence on Data Engineering (ICDE), pages 975–986. IEEE, 2015.

[129] Bolong Zheng, Kai Zheng, Xiaokui Xiao, Han Su, Hongzhi Yin, Xiaofang Zhou, and GuohuiLi. Keyword-aware continuous knn query on road networks. In Data Engineering (ICDE),2016 IEEE 32nd International Conference on, pages 871–882. IEEE, 2016.

[130] Kai Zheng, Shuo Shang, Nicholas Jing Yuan, and Yi Yang. Towards efficient search foractivity trajectories. In Data Engineering (ICDE), 2013 IEEE 29th International Conferenceon, pages 230–241. IEEE, 2013.

[131] Kai Zheng, Yu Zheng, Nicholas Jing Yuan, and Shuo Shang. On discovery of gatheringpatterns from trajectories. In Data Engineering (ICDE), 2013 IEEE 29th International Con-ference on, pages 242–253. IEEE, 2013.

[132] Yu Zheng, Lizhu Zhang, Zhengxin Ma, Xing Xie, and Wei-Ying Ma. Recommending friendsand locations based on individual location history. ACM Transactions on the Web, 5(1):1–44,2011.

[133] Yu Zheng, Lizhu Zhang, Xing Xie, and Wei-Ying Ma. Mining interesting locations and travelsequences from gps trajectories. In WWW, pages 791–800, 2009.

103

Page 115: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

Appendix A

List of Publications

Chapter 3

• H. Liang, K. Wang, F. Zhu, “Mining Social Ties Beyond Homophily”, In Proceedings of theIEEE 32nd International Conference on Data Engineering (ICDE), pp. 421-432. IEEE, 2016.[64]

Chapter 4

• C. Zhang, H. Liang, K. Wang, J. Sun, “Personalized Trip Recommendation with POI Avail-ability and Uncertain Traveling Time”, In Proceedings of the 24th ACM International onConference on Information and Knowledge Management (CIKM), pp. 911-920, ACM, 2015.(Runner-up Best Student Paper Award) [123]

• C. Zhang*, H. Liang*, K. Wang, “Trip Recommendation Meets Real World Constraints: POIAvailability, Diversity and Traveling Time Uncertainty”, ACM Transactions on InformationSystems (TOIS) 35, no. 1 (2016): 5. (*Co-first authorship and equal contribution) [122]

Chapter 5

• H. Liang, K. Wang, “Top-k Route Search through Submodularity Modeling of RecurrentPOI Features”, Accepted by the 41st ACM SIGIR International Conference on Research andDevelopment in Information Retrieval (SIGIR), ACM, 2018. [63]

My Contributions

For the ICDE’16 and SIGIR’18 papers as listed above, I contributed in coming up with the ideaof the papers, proposing the solutions, doing the experiments and writing the complete papers. Forthe CIKM’15 paper, the original idea of doing this problem was proposed by me, and the optimalmethod PDFS was my major contribution to the paper. The first author Dr. Chenyi Zhang was asenior Ph.D. student and my lab mate during the time we wrote this paper, he provided many key

104

Page 116: Exploring Behavioral Data in Online Social Media with ...hongweil/files/Thesis_Hongwei_SFU.pdf · Exploring Behavioral Data in Online Social Media with Focus on User Connectivity

ideas of the problem, such as modeling the personalized user preferences and the uncertain travelingtime, the state expansion optimal method and the heuristic methods. The TOIS journal paper is anextension of the CIKM’15 paper. The whole extension part (explained in the TOIS paper) wascompleted by me. The CIKM’15 and TOIS papers are not included in Dr. Chenyi Zhang’s Ph.D.thesis.

105