Bipartite Graph Sampling Methods for Sampling ... · Bipartite Graph Sampling Methods for Sampling Recommendation Data ... The majority of this body of literature is on

1

Bipartite Graph Sampling Methods for Sampling Recommendation Data

Zan Huang

September 2009

Zan Huang is Assistant Professor of Supply Chain and Information Systems (email: [email protected]) at

Penn State University, University Park, PA 16802.

2

Abstract

Sampling is the common practice involved in academic and industry efforts on

recommendation algorithm evaluation and selection. Experimental analysis often

uses a subset of the entire user-item interaction data available in the operational

recommender system, often derived by including all transactions associated with a

subset of uniformly randomly selected users. Our paper formally studies the

sampling problem for recommendation to understand to what extent algorithm

evaluation results obtained from different types of samples correspond with those

obtained from the population data. We use a bipartite graph to represent the key

input data for recommendation algorithms, the user-item interaction data, and

build on the literature on unipartite graph sampling to develop sampling methods

for our context of bipartite graph sampling. We also developed a series of metrics

for the quality of a given sample, including performance recovery and ranking

recovery measures for assessing both single-sample and multiple-sample recovery

performances. We applied this framework on two real-world recommendation

datasets. Based on the empirical results we provide some general

recommendations for sampling for recommendation algorithm evaluation.

Keywords: Graph sampling, recommender systems, algorithm evaluation and

selection, bipartite graph

3

1. Introduction

Recommender systems automate the process of recommending products, services, or information

items (thereafter referred to as items) to users or consumers (thereafter referred to as users) based

on various types of data concerning users, items, and previous interactions between users and

items (Resnick et al. 1997). A wide range of applications can be modeled using the

recommendation framework when one interprets users and items in the broad sense (Breese et

al. 1998; Deshpande et al. 2004; Konstan et al. 1997; McDonald et al. 2000; Pazzani 1999;

Schafer et al. 2001): researchers locating academic papers of potential interest, students looking

for relevant learning materials, buyers searching for products (especially experience goods such

as movies), advertisers identifying potential customers for products and services, Web-based

information systems suggesting links to follow, among others. User-item interactions can also

take different forms such as ratings, paper downloading, product/service purchases, and catalog

and website browsing activities.

Significant academic efforts have been devoted to recommender system-related research

since the concept of collaborative filtering, which makes recommendations only using the user-

item interaction data, was first implemented and popularized around 1995 (Hill et al. 1995;

Resnick et al. 1994; Shardanand et al. 1995). The majority of this body of literature is on

recommendation algorithm design and evaluation. Recent survey paper (Adomavicius et al.

2005), book (Brusilovsky et al. 2007), and special issues (Felfernig et al. 2007; Konstan 2004)

highlight the massive amount of work that has been done and extended interest in this area.

There is also ever increasing industry interest and efforts in improving the performance of real-

world recommender systems, highlighted by the Netflix Prize launched in October (Bennett et al.

2007). A wide range of recommendation algorithms have been developed and numerous

evaluation studies have been conducted to compare the performances of these algorithms on

different recommendation datasets. Many evaluative studies (e.g. (Deshpande et al. 2004;

Herlocker et al. 2004; Huang et al. 2007a)) have collectively revealed that there seems to be no

single best algorithm and relative performances of different algorithms are largely domain- and

data-dependent.

For practitioners who need to select the best algorithm to be applied to an operational

recommender system, there is the need to assess a wide range of algorithms on their data and

4

choose the best. This requires numerous runs of the recommendation data resulted from

potentially very large set of candidate algorithms. For designing new algorithms, numerous runs

of the data on different versions of the new algorithm and benchmark algorithms are also needed.

As recommender systems are being increasingly applied in a wide variety of application

domains and become the key competitive technology for many retailers and service providers,

the scale of these systems increase continuously as new users and items enter the system and

user-item interactions cumulate over time. The sheer size of the recommendation data possessed

by these organizations makes it prohibitively expensive and time-consuming to perform

algorithm evaluation on the entire dataset. As an illustrative example, as of August 2007 Netflix

has collected over 1.9 billion ratings from more than 11.7 million subscribers on over 85

thousand items (Bennett et al. 2007). When considering another key player of the

recommendation technology, Amazon, the size of the recommendation data would probably be

even larger with millions or more users and items in the system. For such large datasets, it is

unacceptably time-consuming to assess the performance of even relatively simple algorithms

such as the commonly used user-based or item-based neighborhood algorithms, not to mention

those more complex model-based algorithms that require computationally intensive model

estimation and prediction.

The common practice is to evaluate algorithms on a sample of the population data. The

100 million ratings Netflix provided for the Netflix Prize Competition is a sample of about 480

thousand randomly-chosen users from the population data. Most research papers that use the

Netflix dataset typically use a subsample of that sample. Other datasets used in the literature are

also samples of population data in almost all cases. Although using the sample data has long

been the standard practice, there is no formal study on sampling of the recommendation data.

Creating a sample by including a random subset of users and all items they have interacted with

(rating, purchase, browsing, etc.) seems to be the most common approach to obtain the sample

data. The understanding on how well the algorithm evaluation results obtained from such a

sample correspond to the results would be obtained from the population data is lacking. There

are potentially other approaches for obtaining the sample. A straightforward second approach

would be selecting a random subset of items and include all users interact with them. There is no

formal investigation into other sampling methods and the quality of the samples the produce.

5

This paper formally studies the sampling problem for recommendation. The most

important type of data involved in the recommendation process is the user-item interaction data,

which can be conveniently viewed as a bipartite graph representation where users and items are

two types of nodes and edges formed between nodes of different types represent user-item

interactions (Huang et al. 2007b). Building on the literature on network/graph sampling, we

investigate a number of sampling methods with user, item or edge orientation. Within our

framework, we also introduce various measures to assess the correspondence of a sample with

the population data with respect to recommendation algorithm performances. In this study, we

focus on collaborative filtering algorithms that only use user-interaction data (the bipartite graph

itself without the user/item attributes). We also focus on transaction-based recommendation

where recommendations are made based on observed transactions (purchases or experiences)

without rating information. We apply this framework to three real-world recommendation

datasets and compare the performances of the sampling methods.

The remainder of the paper is structured as follows. We first provide some background

on recommender systems with details on collaborative filtering algorithms used in this study and

a review of the network/graph sampling literature. We then present the methods for sampling

recommendation data as a bipartite graph, followed by metrics included in our framework for

assess the quality of recommendation samples. In the Experimental Study section, we present

results obtained on the two real-world datasets. Lastly we conclude the paper by summarizing

main contributions and conclusions and pointing out future directions.

2. Background and Related Work

2.1 Recommender Systems

Depending on the input data used, a recommendation algorithm can be roughly categorized into

content-based (using item attributes and interaction data), demographic filtering (using user

attributes and interaction data), collaborative filtering (CF) (using the interaction data only), and

hybrid approaches (using multiple types of input data) (Huang et al. 2004b; Pazzani 1999;

Resnick et al. 1997). The CF approach that only uses the interaction data (Hill et al. 1995;

Resnick et al. 1994; Shardanand et al. 1995) has been acknowledged to be the most commonly-

6

used approach and the majority of recommendation algorithms proposed in the literature falls

into this category (Adomavicius et al. 2005).

In our study, we focus our attention to sampling recommendation for CF algorithm

evaluation. In other words, we assume the only data available in the population data is the user-

item interactions. We further limit the scope of this study to transaction-based collaborative

filtering and leave the rating-based recommendation for future study. Therefore the population

recommendation data in our study are user-item interactions that can viewed as a binary matrix

(elements being 0 or 1) or an undirected unweighted bipartite graph. Correspondingly the

sampling methods studies here are also designed based on this population data. To extend the

study to recommendation that use additional types of data (i.e., user/item attributes), the

sampling methods may also utilize additional information from these other types of data.

Nevertheless, the results reported in this study still provide important insights to other types of

recommendations as well, as the user-item interaction data is the most important input data for

any type of recommendation approaches.

We next provide some details on the CF algorithms used in this study for

recommendation sample evaluation. We first introduce a common notation for describing a CF

problem. The input of the problem is an M N interaction matrix A = (aij) associated with M

users C = {c1, c2,…, cM} and N items P = {p1, p2, …, pN}. We focus on recommendation that is

based on transactional data. That is, aij can take the value of either 0 or 1 with 1 representing an

observed transaction between ci and pj (for example, ci has purchased pj) and 0 absence of

transaction. We consider the output of a collaborative filtering algorithm to be potential scores of

items for individual users that represent possibilities of future transactions. A ranked list of K

items with the highest potential scores for a target user serves as the recommendations.

One basic collaborative filtering algorithm is the well-tested user-based neighborhood

algorithm using statistical correlation (Breese et al. 1998). To predict the potential interests of a

given user, this algorithm first identifies a set of similar users based on correlation coefficients or

similarity measures using the past transactions, and then makes a prediction based on the

behavior of these similar users. The fundamental assumption is that users who have previously

purchased a large set of the same items will continue to buy the same set of new items in the

future. Formally, the algorithm first computes a user similarity matrix WC = (wcst), s, t = 1, 2,

7

…, M. The similarity score wcst is calculated based on the row vectors of A using a vector

similarity function (such as in (Breese et al. 1998)). A high similarity score wcst indicates that

users s and t may have similar preferences since they have previously purchased a large set of

common items. WC·A gives potential scores of the items for each item.

The item-based algorithm (Deshpande et al. 2004) is different from the user-based

algorithm only in that item similarities are computed instead of user similarities. The assumption

here is that items that have been bought by the same set of users will continue to be co-purchased

by other users. The user-based and item-based algorithms are the mostly commonly used CF

algorithms. Formally, this algorithm first computes a item similarity matrix WP = (wpst), s, t = 1,

2, …, N. Here, the similarity score wpst is calculated based on column vectors of A. A high

similarity score wpst indicates that items s and t are similar in the sense that they have been co-

purchased by many users. A·WP gives the potential scores of the items for each user.

The graph-based algorithms, such as the spreading activation algorithm (Huang et al.

2004a), explore longer paths to exploit the transitive user-item associations. The fundamental

assumption is that the behavior of the transitive neighbors (neighbors of the neighbors) is also

informative in predicting the behavior of the users. The spreading activation algorithm starts with

graph-based representation of the interaction matrix. Both the users and items are represented as

nodes each with an activation level j, j = 1, …, N. To generate recommendations for user c, the

corresponding node is set to have activation level 1 (c = 1). Activation levels of all other nodes

are set at 0. After initialization the algorithm repeatedly performs the following activation

procedure:

j(t + 1) =

1

0

n

i

iijs ttf ,

where fs is the continuous SIGMOID transformation function or other normalization functions; tij

equals if i and j correspond to an observed transaction and 0 otherwise (0 < < 1). The

algorithm stops when activation levels of all nodes converge. The final activation levels j of the

item nodes give the potential scores of all items for user c. In essence this algorithm achieves

efficient exploration of the connectedness of a user-item pair within the user-item graph. The

connectedness concept corresponds to the number of paths between the pair and their lengths and

serves as the predictor of occurrence of future interaction.

8

Another commonly studied algorithm is the generative model algorithm (Hofmann 2004;

Ungar et al. 1998).. Under this approach, latent class variables are introduced to explain the

patterns of interactions between users and items. One latent class variable can be used to

represent the unknown cause that governs the interactions between users and items. The

interaction matrix A is considered to be generated from the following probabilistic process: (1)

select a user with probability P(c); (2) choose a latent class with probability P(z|c); and (3)

generate an interaction between user c and item p (i.e., setting acp to be 1) with probability P(p|z).

Thus the probability of observing an interaction between c and p is given by

zzpPczPcPpcP )|()|()(),( .

Based on the interaction matrix A as the observed data, the relevant probabilities and conditional

probabilities are estimated using a maximum likelihood procedure called Expectation

Maximization (EM). Based on the estimated probabilities, P(c, p) gives the potential score of

item p for user c.

For recommendation algorithm evaluation, a given recommendation dataset is typically

split into a training set and a testing set. The training set is used as input data to generate

recommendations, which are compared with transactions involving relevant users and items

appearing in the testing set. In this context, relevant users/items refer to those have appeared in

the training set. For algorithm performance assessment we use the following recommendation

quality metrics from the literature regarding the relevance, coverage, and ranking quality of the

ranked list recommendation (e.g., (Breese et al. 1998)):

precision: Pc = K

hitsofNumber ,

recall: Rc = set testingin the with interacted user items ofNumber

ofNumber

c

hits ,

F: Fc = cc

cc

RP

RP

2 , and

rank score: j hj

cjc

qRS

)1/()1(2,

where j is the index for the ranked list; h is the viewing half-life (the rank of the item on the list

such that there is a 50% chance the user will purchase that item);

otherwise 0,

set, testings'in is if 1, cjqcj .

9

The precision and recall measures are essentially competing measures. As the number of

recommendation K increases, one expects to see lower precision and higher recall. For this

reason, the precision, recall, and F measures might be sensitive to the number of

recommendations. The rank score measure was proposed in (Breese et al. 1998) and adopted in

many follow-up studies (e.g., (Deshpande et al. 2004; Herlocker et al. 2004; Huang et al. 2004a))

to evaluate the ranking quality of the recommendation list. The number of recommendations has

a similar but minor effect on the rank score, as the rank score of individual recommendations

decreases exponentially with the list index.

We also adopt a Receiver Operating Characteristics (ROC) curve-based measure to

complement precision and recall. Such measures are used in several recommendation evaluation

studies (Herlocker et al. 2004; Papagelis et al. 2005). The ROC curve attempts to measure the

extent to which a learning system can successfully distinguish between signal (relevance) and

noise (non-relevance). For our recommendation task, we define relevance based on user-item

pairs: a recommendation that corresponds to a transaction in the testing set is deemed as relevant

and otherwise as non-relevant. The x-axis and y-axis of the ROC curve are the percent of non-

relevant recommendations and the percent of relevant recommendations, respectively. The entire

set of non-relevant recommendations consists of all possible user-item pairs that appear neither

in the training set (not recommended) nor in the testing set (not relevant). The entire set of

relevant recommendations corresponds to the transactions in the testing set. As we increase the

number of recommendations from 1 to the total number of items, a ROC can be plotted based on

the corresponding relevance and non-relevance percentages at each step. The area under ROC

curve (AUC) gives a complete characterization of the performance of the algorithm across all

possible numbers of recommendations. An ideal recommendation model that ranks all the future

purchases of each user at the top of their individual recommendation list would expect to have

steep increase in the beginning and to flat out in the end, with AUC close to 1. A random

recommendation model would be a 45-degree line with AUC at 0.5.

For precision, recall, and F measure, an average value over all users tested was adopted

as the overall metric for the algorithm. For the rank score, an aggregated rank score RS for all

user tested was derived as

10

c c

c c

RS

RSRS

max100 ,

where maxcRS was the maximum achievable rank score for user c if all future purchases had been

at the top of a ranked list. The AUC measure was used directly as an overall measure for all

users.

2.2 Graph Sampling

Our population recommendation data can be viewed as an undirected bipartite graph. This graph

is the only information based on which sampling methods can be developed. Our problem of

sampling recommendation data is equivalent to sampling a bipartite graph. There is a recent

growing literature on graph/network sampling, which primarily focuses on unipartite graphs

where there is only one type of nodes and links are possible between any pair of nodes. Most of

these studies are interested in whether or to which extent the graph samples exhibit similar

topological characteristics of the population graph. In this section, we briefly review the

literature in this area.

There are two streams of research under the name of network/graph sampling. The first

stream deals with sampling from a set of graphs as the population (e.g, (Erdos et al. 1959; Erdos

et al. 1960). The sample in these studies consists of a smaller number of graphs from the

population. The second stream deals with obtain a subgraph as a sample from a population

graph, which is also referred to more specifically as subgraph sampling. Our study deals with

subgraph sampling from a bipartite graph for recommendation algorithm evaluation purpose.

Subgraph sampling studies date back to 1960s in mathematics and sociology. Many

sampling methods were introduced and analyzed since then. Frank (Frank 1978) studied the

random node sampling method, which obtains a random subset of the nodes and includes all

edges among these selected nodes to form the sample graph. Capobianco and Frank (Capobianco

et al. 1982) studied a commonly used sampling method for studying social networks in sociology

literature, the star or ego-centric sampling method. Under this method, seed nodes are randomly

sampled from the population and then all neighbors of the seeds and the seed-neighbor and

neighbor-neighbor links are included to form the sample graph. Another commonly used

sampling method in sociology studies called snowball or chain-referral sampling was studied in

11

(Frank 1979; Goodman 1961). Under this approach, a seed node is selected as the starting point

and the population graph is traversed under a breadth-first fashion (including all neighbors of the

seed and then all neighbors of the neighbors of the seed and so forth). Klovdahl (Klovdahl 1977)

studied the method of random walk sampling, which received substantial interest in follow-up

network sampling studies. This method is similar to snowball sampling as it also traverses the

graph to form a sample. It differs from snowball sampling in that only one randomly chosen

neighbor (instead of all neighbors) of the current node is included into the sample. These early

studies have focused on obtaining theoretical and empirical results on the ability of these

sampling methods to recover simple network characteristics such as density.

Several recent studies have investigated subgraph sampling of large-scale graphs, with

the initial focus on sampling the Internet (e.g., (Lakhina et al. 2003)) and WWW (e.g.,

(Rusmevichientong et al. 2001)). These studies are typically domain-specific and do not directly

apply to sampling generic networks. Several recent studies have addressed the general graph

sampling problem on large-scale graphs with a focus on recovering more complex graph

topological characteristics such as degree distribution, path length, and clustering. Many of these

studies focus on a single graph characteristic or focus on just a few sampling methods (Lee et al.

2006; Newman 2003; Stumpf et al. 2005). Several recent studies provide comprehensive

assessment of a wide range of sampling methods for recovering many graph topological

characteristics. Leskovec and Faloutsos (Leskovec et al. 2006) evaluate the samples generated

from a large number of sampling methods on a number of real-world networks. The sampling

methods they studied include random node selection, random edge selection, random walk,

random jump, and snowball methods. They also introduced a new method call forest fire method,

which falls in between the snowball and random walk methods. A randomly selected subset of

the neighbors of the current node is included to form samples. The main conclusions from their

study are that simple uniform random node sampling performs surprisingly well and the overall

best performing methods are the ones based on random walk and forest fire.

Most of the studies reviewed so far on graph sampling fall into what we refer to as the

edge-burning sampling. Take random walk as an example, only the edges that were visited

during the graph exploration process are included into the sample graph. A recent study on social

networks sampling (Ebbes et al. 2008) investigated the node-burning sampling, which includes

all edges among the nodes visited. In this study, we consider node-burning sampling to be more

12

appropriate as there is no incentive to drop some ratings/interactions for user-item pairs included

into the sample for recommendation algorithm evaluation purpose. The main conclusions from

(Ebbes et al. 2008) are that snowball, random walk, and forest fire are the best performing

sampling methods for social networks of certain degree distribution and density levels.

3. Sampling Recommendation Data

User-item interaction data can be naturally represented as a graph by treating users and items as

nodes, and transactions involving user-item pairs as links between these nodes. This type of

graph is a bipartite graph whose nodes can be divided into two distinct sets and whose links only

make connections between the two sets. The input data for collaborative filtering has been

traditionally represented by an interaction matrix. The example shown in Figure 1, which

involves 3 users, 4 items, and 7 past transactions, shows a direct mapping between the two

representations.

0 1 0 1

1 1 1 0

1 0 1 0

3

2

1

4321

c

c

c

pppp

User-item matrix User-item graph

c3c2

Item Nodes

User Nodes

p1 p2

c1

p4p3

Figure 1: Example user-item interaction matrix and corresponding bipartite user-item

graph and projected user/item graphs

Given a population recommendation data, we use a bipartite graph G = (C, P, E) with the

user node set C = {c1, …, cM}, item node set P = {p1, …, pN}, and edge set E = {<ci, pj >} where

ci C and pjP. The recommendation sampling problem is to find what sampling method

13

produces a subgraph of G that provides best ratio between level of sample-population algorithm

evaluation correspondence and sample size.

Sample size in the graph sampling literature typically refers to the number of nodes

included in the sample graph. In our context of sampling for recommendation, we chose to

define sample size as the number of edges (or user-item interactions) in the sample bipartite

graph. Under this definition the samples recommendation data of the same size have comparable

file size under the relational representation.

The main departure from unipartite graph sampling in our study is that we have two sets

of nodes to sample from, users and items. For the current paper, we study three types of sampling

methods: user-oriented, item-oriented and edge-oriented. Under the user-oriented sampling, we

identify a subset C’ of users from the population user set and include all items interacting with

the selected users to form the sample such that the sample contains λ|E| edges where λ is the pre-

specified sampling ratio,. Under the graph representation, this approach is to include C’, links

from nodes in C’ and item nodes reached by these links. Similarly, under the item-oriented

sampling, we identify a subset P’ of items from the population item set and include all users

interacting with the selected items to form the sample such that the sample contains λ|E| edges .

Under the edge-oriented sampling, we select a subset E’ of the edges from the population edge

set with size λ|E| and include all user and item nodes incident on these edges.

In this paper, we study the following sampling methods: random, random walk, snowball,

and forest fire sampling methods under user-oriented, item-oriented, and edge-oriented sampling.

Among these, the random sampling has been commonly used as the benchmark sampling method

in the graph sampling literature and the user-oriented random sampling is also the common

method used in the existing practice of recommendation data sampling. Random walk, snowball,

and forest fire sampling methods are variants of graph exploration sampling methods that have

been reported to perform well in various network contexts.

Random sampling

The random sampling approach is the most straightforward sampling method. The subset of

users/items/edges is selected uniformly randomly from the population user/item/edge set under

user-oriented/item-oriented/edge-oriented sampling. Since we try to obtain a sample containing a

specific number of edges, under the user-oriented (item-oriented) sampling, we include one user

14

(item) at a time and all user-item edges involved. When adding a user (item) results in more

edges than needed, a random subset of user-item edges involved with that user (item) is included

into the sample. The sample recommendation data used for Netflix Prize Competition is obtained

from a user-oriented random sampling in our terms.

Graph exploration sampling

Under methods in this approach, we select the nodes to include in the sample by navigating the

population network following the links from a randomly selected starting seed node, s. Under the

user-oriented (item-oriented) sampling, we operate this graph navigation on the projected user

(item) graphs. A user (item) graph projected from the user-item graph forms an edge between

two users (items) if the two users (items) interact with at least one common item (user). Figure 1

shows the project user and item graphs with our simple example. Being consistent with the user-

oriented (item-oriented) random sampling, we include one user (item) at a time and the user-item

edge involved until reaching the sample size. Under the edge-oriented sampling, the graph

navigation is operated on the population user-item graph.

Specifically, graph navigation sampling works as follows. Starting from s, these methods

select either a subset or the entire set of unselected neighbors of s and repeat this process for each

node just included into the sample. If the desired number of edges cannot be reached (i.e., the

exploration process has fallen into a small component of the population graph isolated from

others) a new random seed node is selected to re-start the exploration process.

There are several variations of the graph-exploration methods depending on how we

select the neighbors of a node v that is reached via the exploration process. The snowball method

includes all unselected neighbors of v. The random walk method selects exactly one neighbor

uniformly at random from all the unselected neighbors. These two extremes can also be

understood as the well-studied breath-first and depth-first searches from the graph search

literature respectively. Between these two extremes is the forest fire method (Leskovec et al.

2006), which selects x unselected neighbors of v uniformly randomly. In our study we set this

number x to be uniformly drawn from [1, 2, 3]. In our future study we will investigate additional

settings for x. When implementing these methods, due to the nature of the snowball method, we

can simply forbid revisiting the nodes. For random walk and forest fire, nodes need to be

revisited for the exploration to carry on. In order to avoid being stuck in a small component of

15

the population graph, we set the exploration process to jump to another randomly chosen seed

with a probability of 0.05.

4. Evaluation of Recommendation Samples

In the context of our study, the quality of a sample should reveal the extent to which the

recommendation algorithm evaluation results obtained from sample data align with those

obtained from the population data. Given a recommendation algorithm performance measure of

interest as described earlier in the paper, we consider two types of sample quality measures,

performance recovery measure and ranking recovery measures.

We denote the chosen performance measure as m and the measures obtained using a

particular algorithm j on the sample si and population data as mj(si) and mj(p), respectively. The

performance recovery measure for measure m, algorithm j, and sample i is straightforwardly

defined as:

prm,j,i = mj(si) – mj(p)

The ranking recovery measure reflects the alignment between the ranking of the relative

performances of J algorithms considered for the sample and for the population data. We denote

the ranks of algorithm j in terms of measure m for the population p and sample si as rj,m(p) and

rj,m(si), respectively. The ranking recovery measure for measure m, and sample i is defined as:

j mjimjim prsrrr 2

,,, ))()((

Under this definition, we penalize large ranking differences. For example, suppose the ranking

lists of four algorithms for population data is [1,2,3,4] and two samples produce ranking lists of

[2,1,4,3] and [2,3,4,1], the ranking recovery measures for the two samples will be 4 and 12,

respectively. We consider the first sample recovers the rankings relatively better. For four

algorithms, the ranking recovery measure ranges from 0 to 20. It may be argued that in the

previous example, the second sample recovers the ranking better as the ranking among the first

three algorithms are the same as the population results. We leave investigation on other

definitions of raking quality to future work.

For given sample ratio and sample method, the quality of the sample may vary across

different instances of the samples obtained. In order to estimate the quality of single sample, we

16

use multiple samples in the experiment and use the average recovery measures across the

samples to obtain an estimate of the expected recovery measures. For the performance recovery

measure, we define mean absolute performance recovery measure for method m and algorithm j

as:

ijij

jm Ipmsm

mapr|)()(|

,

where I is the number of samples.

Note that the expected recovery measure from multiple samples may not tell the complete

story. This measure could be misleading when certain sampling methods result in small mean

absolute preference recovery measure but with a very large variance. To address this issue, we

also define a percentile absolute performance recovery measure as

paprm,j = maprm,j + k stdev(aprm,j).

Assuming the absolute performance recovery measure of a sample follows a normal distribution,

Pr(aprm,j ≤ paprm,j) = x%. For example, when k is set to 1 the percentile is 84.1%.

To summarize the recovery of performance for multiple algorithms, we average across the

algorithms:

jjm

m Jmapr

mapr , , jjm

m Jpapr

papr ,

Similarly for the ranking recovery function, we define the mean ranking recovery measure and

percentile ranking recovery measure:

iim

m Irr

mrr , , prrm,j = mrrm,j + k stdev(rrm,j).

When multiple samples are available, there is yet another way to recover the population

evaluation results from the results on multiple samples. For performance recovery measure, we

can average across multiple samples and use the mean performance recovery measure. This

makes intuitive sense as we expect the mean to have smaller variance. Note this measure

represents how a set of samples, rather than a single sample, can recover the population

evaluation results. Specifically, we define the mean performance recovery measure for method m

and algorithm j as:

17

ijij

jm Ipmsm

mpr)()(

,

We can similarly obtain the ranking recovery measure based on multiple samples. We

may obtain the ranking list based on the mean performance recovery measure across multiple

samples, which we denote as ranking of mean recovery measure, rmrm.

5. Experimental Study

In this study, we follow our recommendation data sampling framework to investigate the

performance of individual sampling methods on two real-world recommendation datasets.

5.1 Datasets

For our experiments, we used samples from two recommendation datasets: a retail dataset from a

major US online apparel merchant and a movie dataset from the MovieLens Project. For the

movie rating dataset we treated a rating on product pj by consumer ci as a transaction (aij = 1) and

ignored the actual rating. Such adaptation has been adopted in several recent studies such as

(Deshpande and Karypis 2004) and is consistent with the “Who Rated What” task in the 2007

ACM KDD Cup competition. Assuming that a user rates a movie based on her experience with

the movie, we recommend only whether a consumer will watch a movie in the future and do not

deal with the question of whether or not she will like it.

We included users who had interacted with 5 to 100 items for meaningful testing of the

recommendation algorithms. We set this rather challenging minimum number of transactions of

5 to include data that represent the commonly discussed “sparsity problem” in the recommender

system literature. It is often important for a recommender system to deliver recommendation of

acceptable quality for new users who only had limited transaction history. We sampled 1,000

users from the retail and movie datasets for the experiment. The details about the final dataset we

used as population data are shown in Table 1. Density level of a data set reported in the table is

defined as the number of transactions (links between consumer-product pairs) and all possible

transactions ((# of consumers)×(# of products)). The movie dataset has the highest density level

(1.75%), followed by the book dataset (0.19%) and then the retail dataset (0.13%).

18

Table 1: Basic data statistics of the retail and movie datasets

Dataset # of

Consumers # of

Products # of

Transactions Density Level

Avg. # of purchases per consumer

Avg. sales per product

Retail 1,000 7,328 9,332 0.13% 9.33 1.27

Movie 1,000 2,900 50,748 1.75% 50.75 17.5

5.2 Experimental Procedure

For algorithm performance evaluation on the population datasets, we first randomly selected 2/3

of the data into the training set. Within the remaining 1/3 of the transactions, only those

involving users and items appeared in the training set were included into the testing set. We

applied the user-based and item-based neighborhood algorithm, spreading activation algorithm,

and generative model algorithm described earlier and compute the precision, recall, F, rank

score, and AUC measures. We have set the number of recommendations for each user to be 10 in

our study. Table 2 shows the population algorithm evaluation results.

Table 2. Population algorithm evaluation results

precision recall F rank score AUC

Retail

user-based 0.0124 0.0672 0.0201 4.3791 0.5470 item-based 0.0153 0.0853 0.0250 5.6536 0.5800 spreading activation 0.0175 0.0932 0.0279 7.3039 0.6047 generative 0.0102 0.0424 0.0156 1.0784 0.6322

Movie

user-based 0.1983 0.1304 0.1470 26.5333 0.8768

item-based 0.2686 0.1786 0.2005 34.0500 0.9144 spreading activation 0.1649 0.1035 0.1192 22.8833 0.8660

generative 0.1213 0.0768 0.0880 15.2417 0.8435

With a given sampling ratio we obtain 20 samples using random (rand), random walk

(rw), snowball (sb), and forest fire (ff) methods under the user-oriented, item-oriented, and edge-

oriented sampling. With each sample data, we follow exactly the same procedure as the

population data to obtain training and testing sets and obtain algorithm evaluation results with

the four recommendation algorithms. We then obtain the sample quality measures introduced

earlier for comparison. In our study we focus on relatively small sampling ratios as the

19

performance of sampling methods for small samples are most interesting. Specifically, we

studied samples of sizes 4%, 6%, 8%, 10%, and 20%.

5.3Results

Figures 2 and 3 show the sampling results for the retail and movie datasets. We summarize the

main findings from these results below.

Percentile vs. mean recovery measures. Comparing the left panel with the center panel

in Figures 2 and 3 we see that the relative performance of the sampling methods are generally

consistent between percentile recovery and mean recovery measures. This finding excludes the

situation where certain sampling methods have small mean recovery measure with large variance

or large mean recovery measure with small variance in our datasets.

No single best sampling method. Focusing on studying the left panel, we see that the

relative performances of individual sampling methods from the results on the two datasets we

studied are far from a clear picture. The best sampling method varies with respect to dataset,

performance measure interested (F or AUC), and whether performance recovery or ranking

recovery is the goal.

We observe quite different patterns of relative performance of the sampling methods for

the retail and movie datasets. For the retail dataset, the best performing sampling method for the

performance recovery of F and AUC measure and recovery of ranking based F measure is either

edge-oriented random walk or edge-oriented forest fire, while the best method for recovery of

ranking based on AUC measure is user-oriented snowball, with edge-oriented forest fire or edge-

oriented snowball as the second best. There is an interesting reversal for some methods for the

retail dataset. Edge-oriented snowball and user-oriented snowball performed the worst for F

measure performance recovery while edge-oriented random walk was the third worst method

after user-oriented random and edge-oriented random for recovery of ranking based on AUC

measure. It seems that using the edge-oriented forest fire would be the best choice considering all

four objectives together.

For the movie dataset, we observe that the edge-oriented methods generally performed

worst for all four recovery measures. For performance recovery measure of F measure, the user-

oriented and item-oriented sampling methods all lumped together while the user-oriented

20

sampling methods clearly outperformed item-oriented methods for recovery of ranking based on

F measure. For performance recovery of AUC measure we see item-oriented methods

outperform user-oriented methods with edge-oriented snowball sitting in between. For recovery

of ranking based on AUC measure, for samples up to 10%, user-oriented methods are the clear

winner. However, once the sample size reaches 20% item-oriented methods delivered much

better results than user-oriented methods (2.21 from item-oriented random and snowball

compared with 5.61 with user-oriented forest fire).

While further investigation is needed on more datasets, we conjecture that the different

results we see on the retail and movie datasets may be due to the very different sparsity level of

the two datasets.

Single sample vs. mean from multiple samples. Comparing the center panel and right

panel in Figures 2 and 3 we see that using mean from multiple samples generally helps to

recover performance measure and this is even more so the case for recovering ranking of

algorithms. The improvement of performance recovery using multiple samples is more

prominent with the retail dataset. In particular, for recovery of the F measure using mean of 20

samples of size 4% from the best method (edge-oriented forest fire sampling) is comparable with

a single sample of size 20% from the best sampling methods while using mean of 20 samples of

size 6% from the best method (user-oriented random sampling) achieved much better

performance recovery of 0.004 than that of best single sample of size 20% (0.01). For recovery

of the AUC measure, the improvement is even more prominent. Using mean of 20 samples of

size 4% from the best method (user-oriented random walk) achieved much better performance

recovery of 0.017 than that of the best single sample of size 20% from edge-oriented forest fire

(0.03). The improvement for the movie dataset was much less and we do not see using mean

from 20 smaller samples would outperform using a single sample of size 20%.

Using mean performance measures from multiple samples to recover the ranking of the

algorithms works even better than using single samples. For the F measure of the retail dataset,

ranking of mean recovery measure was worse than mean rank recovery measure for most

sampling methods. But the best performing method of user-oriented random sampling achieved

the ranking of mean recovery measure of 4 for the sample sizes of 4%, 6%, and 10% and the

ranking of mean recovery measure of 2 for the sample sizes of 8% and 20%. These results are

21

much better than the best single sample ranking recovery measures. For the AUC measure of the

retail dataset, the improvement is striking in that more than half of the methods achieved perfect

recovery of ranking of algorithms for all sample sizes and among the remaining five methods

two (item oriented random sampling and item oriented random walk) had the worst ranking of

mean recovery measure of 2 across the sample sizes. For the F measure of the movie dataset, we

see that the four user-oriented sampling methods delivered best result for mean rank recovery

measures and using mean of multiple samples help these methods to reach perfect recovery of

ranking fairly soon. All four methods had perfect ranking of mean recovery measure for sample

sizes greater than 6% and user oriented random walk even had perfect ranking of mean recovery

measure for sample size of 4%. For the AUC measure of the movie dataset, it turned out that it is

more difficult to recover ranking here. We do not see much improvement in terms of the best

recovery that can be obtained by using mean from multiple samples for most cases. The item

oriented forest fire did have perfect ranking of mean recovery measure for the sample size of

20% while the best mean rank recovery measure is 1.2 of item-oriented random and item

oriented snowball 20% samples.

Performance of user-oriented random sampling. User-oriented random sampling is used

in almost all academic studies and practical research on recommendation algorithm evaluation

(include the Netflix Prize dataset). Based on our results, this sampling method is far from the

overall best. Particularly for the retail dataset, it is among the two or four methods for the four

recovery measures for individual samples. When using mean from multiple samples, it had the

best performance for recovery of F measure and ranking based on F measure, medium

performance for recovery of AUC measure and worst performance for recovery of ranking based

on AUC measure. For the movie dataset, using user-oriented random sampling is acceptable for

recovery of F measure and ranking based on F or AUC measure but not for recovery of AUC

measure using single samples or mean from multiple samples.

22

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

0

0.01

0.02

0.03

0.04

0.05

0.06

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

0

0.02

0.04

0.06

0.08

0.1

0.12

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

F - precentile absolute performance recovery F - mean absolute performance recovery F - mean performance recovery

0

2

4

6

8

10

12

14

16

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

0

2

4

6

8

10

12

14

16

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

0

5

10

15

20

25

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

F - precentile rank recovery F - mean rank recovery F - ranking of mean recovery

0

0.05

0.1

0.15

0.2

0.25

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

0

0.05

0.1

0.15

0.2

0.25

0.3

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

AUC - precentile absolute performance recovery AUC - mean absolute performance recovery AUC - mean performance recovery

0

2

4

6

8

10

12

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

0

2

4

6

8

10

12

14

16

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

0

2

4

6

8

10

12

14

16

18

20

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

AUC - precentile rank recovery AUC - mean rank recovery AUC - ranking of mean recovery

Figure 2: Recovery measures for the retail dataset

23

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

F - precentile absolute performance recovery F - mean absolute performance recovery F - mean performance recovery

0

5

10

15

20

25

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

0

5

10

15

20

25

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

0

5

10

15

20

25

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

F - precentile rank recovery F - mean rank recovery F - ranking of mean recovery

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

AUC - precentile absolute performance recovery AUC - mean absolute performance recovery AUC - mean performance recovery

0

5

10

15

20

25

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

0

5

10

15

20

25

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

0

5

10

15

20

25

4% 6% 8% 10% 20%

user rand

user rw

user sb

user ff

item rand

item rw

item sb

item ff

edge rand

edge rw

edge sb

edge ff

AUC - precentile rank recovery AUC - mean rank recovery AUC - ranking of mean recovery

Figure 2: Recovery measures for the movie dataset

24

Practical recommendations. Based on the results from both datasets, it seems that there

is no clear cut answer on which sampling method is the best to use. Based on the understanding

that the AUC measure look at quality of top-K recommendations for all K values and is more a

theoretical evaluation metric we use our results based on the F measure to summarize some

practical recommendations. We also limit our focus to recovery of ranking of algorithms for

these recommendations as that is most often the practical objective of conducting an algorithm

evaluation study. The best sampling method to use seems to depend on the density situation of

the population data. If the data is very sparse, the edge-oriented forest fire or edge-oriented

random walk methods are the best. For denser data, user-oriented sampling methods work best

for recovery of ranking of algorithms with the best specific method varies with the sample size.

Using multiple small samples and then taking the average across samples often is much favored

over using a larger sample. Taking such an approach will often drastically improve the recovery

of ranking of algorithm. We recommend using mean from multiple user-oriented random

samples for sparse data and using mean from multiple user-oriented random walk samples for

dense data. These recommendations are based on the two datasets we have and general

applicability needs to be further verified with additional empirical evidences.

6. Conclusion and Future Research

In this paper we study sampling of recommendation data for algorithm performance evaluation.

Sampling is the common practice involved in academic and industry efforts on recommendation

algorithm evaluation and selection. Experimental analysis often uses a subset of the entire user-

item interaction data available in the operational recommender system, often derived from

including all transactions associated with a subset of uniformly randomly selected users. To the

best of our knowledge, there is no prior formal study attempted to understand to what extend

algorithm evaluation results obtained from such a sample correspond with those would have

been obtained from the population data. Our paper introduces this sampling problem for

recommendation. We use a bipartite graph to represent the key input data for recommendation

algorithms, the user-item interaction data, and build on the literature on graph sampling to

develop our analytical framework. We adapted the sampling methods reported to be of good

25

quality on unipartite graph in the literature and extend them to our context of bipartite graph

sampling. The sampling methods included in our study are random, random walk, snowball, and

forest fire methods under user-oriented, item-oriented, and edge-oriented sampling. We also

developed a series of metrics of the quality of a given sample with respect to recommendation

algorithm evaluation and selection. These metrics include performance recovery and ranking

recovery measures for assessing both single-sample and multiple-sample recovery performances.

We applied this framework on two real-world recommendation datasets. Based on the empirical

results we provide some general practical recommendations for sampling for recommendation

algorithm evaluation targeted at recovering the ranking of algorithms. Our key findings are that

for population data with different density level different sampling methods are recommended:

edge-oriented forest fire or edge-oriented random walk for sparse data and user-oriented methods

for dense data and that using mean from multiple smaller samples often works better than using a

single large sample.

This study represents the first step towards the comprehensive investigation on the

problem of sampling for recommendation. The current work is limited in several aspects, each of

which points to need for extensive future research. Firstly, we need to further our understanding

on why certain sampling methods excelled for certain algorithm evaluation goals. Secondly, we

need to use additional datasets of different characteristics and different scale to evaluate the

generalizability of our current conclusions. Thirdly, we also want to expand our study to include

more recommendation algorithms, to investigate the cases of rating-based collaborative filtering

recommendation and even hybrid recommendation methods that also utilize user/item attributes.

26

References

Adomavicius, G., and Tuzhilin, A. "Towards the next generation of recommender systems: A

survey of the state-of-the-art and possible extensions," IEEE Transactions on Knowledge

and Data Engineering (17:6) 2005, pp 734-749.

Bennett, J., and Lanning, S. "The Netflix Prize," KDD Cup and Workshop 2007, San Jose, CA,

2007.

Breese, J.S., Heckerman, D., and Kadie, C. "Empirical analysis of predictive algorithms for

collaborative filtering," Fourteenth Conference on Uncertainty in Artificial Intelligence,

Morgan Kaufmann, Madison, WI, 1998, pp. 43-52.

Brusilovsky, P., Kobsa, A., and Nejdl, W. Adaptive Web: Methods and Strategies of Web

Personalization Springer, Berlin, 2007.

Capobianco, M., and Frank, O. "Comparison of statistical graph-size estimators," Journal of

Statistical Planning and Inference (6:1) 1982, pp 87-97.

Deshpande, M., and Karypis, G. "Item-based top-N recommendation algorithms," ACM

Transactions on Information Systems (22:1) 2004, pp 143-177.

Ebbes, P., Huang, Z., Rangaswamy, A., and Thadakamalla, H.P. "Sampling of large-scale social

networks: Insights from simulated networks.," 18th Annual Workshop on Information

Technologies and Systems, Paris, France, 2008.

Erdos, P., and Renyi, A. "On random graphs," Publicationes Mathematicae (6) 1959, pp 290-

297.

Erdos, P., and Renyi, A. "On the evolution of random graphs," Publications of the Mathematical

Institute of the Hungarian Academy of Sciences (5) 1960, pp 17-61.

Felfernig, A., Friedrich, G., and Schmidt-Thieme, L. "Guest Editors' Introduction: Recommender

Systems," IEEE Intelligent Systems (22:3) 2007, pp 18-21.

Frank, O. "Sampling and estimation in large social networks," Social Networks (1) 1978, pp 91-

101.

Frank, O. "Estimatiing of population totals by use of snowball sampling," in: Perspectives on

Social Network Research, P.W. Holland and S. Leinhardt (eds.), Academic Press, New

York, 1979, pp. 319-347.

Goodman, L.A. "Snowball sampling," Annals of Mathematical Statistics (32) 1961, pp 148-170.

27

Herlocker, J.L., Konstan, J.A., Terveen, L.G., and Riedl, J.T. "Evaluating collaborative filtering

recommender systems," ACM Transactions on Information Systems (22:1) 2004, pp 5-53.

Hill, W., Stead, L., Rosenstein, M., and Furnas, G. "Recommending and evaluating choices in a

virtual community of use," ACM Conference on Human Factors in Computing Systems,

CHI'95, 1995, pp. 194-201.

Hofmann, T. "Latent semantic models for collaborative filtering," ACM Transactions on

Information Systems (22:1) 2004, pp 89-115.

Huang, Z., Chen, H., and Zeng, D. "Applying associative retrieval techniques to alleviate the

sparsity problem in collaborative filtering," ACM Transactions on Information Systems

(TOIS) (22:1) 2004a, pp 116-142.

Huang, Z., Chung, W., and Chen, H. "A graph model for e-commerce recommender systems,"

Journal of the American Society for Information Science and Technology (JASIST) (55:3)

2004b, pp 259-274.

Huang, Z., Zeng, D., and Chen, H. "A comparative study of recommendation algorithms for e-

commerce applications," IEEE Intelligent Systems (22:5) 2007a, pp 68-78.

Huang, Z., Zeng, D.D., and Chen, H. "Analyzing consumer-product graphs: Empirical findings

and applications in recommender systems," Management Science (53:7) 2007b, pp 1146-

1164.

Klovdahl, A.S. "Social networks in an uban area: First Canberra study," Australian and New

Zealand Journal of Sociology (13) 1977, pp 169-175.

Konstan, J.A. "Introduction to recommender systems: Algorithms and Evaluation," ACM

Transactions on Information Systems (22:1) 2004, pp 1-4.

Konstan, J.A., Miller, B.N., Maltz, D., Herlocker, J.L., Gordon, L.R., and Riedl, J. "GroupLens:

Applying collaborative filtering to Usenet news," Communications of the ACM (40:3)

1997, pp 77-87.

Lakhina, A., Byers, J.W., Crovella, M., and Xie, P. "Sampling biases in IP topology

measurements," 22nd Annual Joint Conference of the IEEE Computer and

Communications Societies, 2003, pp. 332-341.

Lee, S.H., Kim, P.-J., and Jeong, H. "Statistical properties of sampled networks," Physics Review

Letters E (73) 2006, p 016102.

28

Leskovec, J., and Faloutsos, C. "Sampling from large graphs," 12th ACM SIGKDD international

conference on Knowledge discovery and data mining, Philadelphia, PA, 2006, pp. 631-

636.

McDonald, D., and Ackerman, M. "Expertise recommender: A flexible recommendation system

and architecture," ACM Conference on Computer Supported Cooperative Work,

Philadelphia, 2000, pp. 231-240.

Newman, M.E.J. "Ego-centered networks and the ripple effect," Social Networks (25) 2003, pp

83-95.

Papagelis, M., Plexousakis, D., and Kutsuras, T. "Alleviating the sparsity problem of

collaborative filtering using trust inferences," 3rd International Conference on Trust

Management (iTrust 2005), Paris, France, 2005, pp. 224-239.

Pazzani, M. "A framework for collaborative, content-based and demographic filtering," Artificial

Intelligence Review (13:5) 1999, pp 393-408.

Resnick, P., Iacovou, N., Suchak, M., Bergstorm, P., and Riedl, J. "GroupLens: An open

architecture for collaborative filtering of netnews," ACM Conference on Computer-

Supported Cooperative Work, 1994, pp. 175-186.

Resnick, P., and Varian, H. "Recommender systems," Communications of the ACM (40:3) 1997,

pp 56-58.

Rusmevichientong, P., Pennock, D., Lawrence, S., and Giles, C.L. "Methods for sampling pages

uniformly from the WWW," AAAI Fall Symposium on Using Uncertainty Within

Computation, 2001, pp. 121-128.

Schafer, J., Konstan, J., and Riedl, J. "E-commerce recommendation applications," Data Mining

and Knowledge Discovery (5:1-2) 2001, pp 115-153.

Shardanand, U., and Maes, P. "Social information filtering: Algorithms for automating word of

mouth," ACM Conference on Human Factors in Computing Systems, Denver, CO., 1995,

pp. 210-217.

Stumpf, M.P.H., Wiuf, C., and May, R.M. "Subnets of scale-free networks are not scale free:

Sampling properties of networks," Proceedings of National Academy of Science (102:12)

2005, pp 4221-4224.

Ungar, L.H., and Foster, D.P. "A formal statistical approach to collaborative filtering,"

Conference on Automated Learning and Discovery (CONALD), Pittsburgh, PA, 1998.

Documents

Bipartite Graph Sampling Methods for Sampling ... · Bipartite Graph Sampling Methods for Sampling Recommendation Data ... The majority of this body of literature is on