Upload
ngodang
View
242
Download
0
Embed Size (px)
Citation preview
1
Bipartite Graph Sampling Methods for Sampling Recommendation Data
Zan Huang
September 2009
Zan Huang is Assistant Professor of Supply Chain and Information Systems (email: [email protected]) at
Penn State University, University Park, PA 16802.
2
Abstract
Sampling is the common practice involved in academic and industry efforts on
recommendation algorithm evaluation and selection. Experimental analysis often
uses a subset of the entire user-item interaction data available in the operational
recommender system, often derived by including all transactions associated with a
subset of uniformly randomly selected users. Our paper formally studies the
sampling problem for recommendation to understand to what extent algorithm
evaluation results obtained from different types of samples correspond with those
obtained from the population data. We use a bipartite graph to represent the key
input data for recommendation algorithms, the user-item interaction data, and
build on the literature on unipartite graph sampling to develop sampling methods
for our context of bipartite graph sampling. We also developed a series of metrics
for the quality of a given sample, including performance recovery and ranking
recovery measures for assessing both single-sample and multiple-sample recovery
performances. We applied this framework on two real-world recommendation
datasets. Based on the empirical results we provide some general
recommendations for sampling for recommendation algorithm evaluation.
Keywords: Graph sampling, recommender systems, algorithm evaluation and
selection, bipartite graph
3
1. Introduction
Recommender systems automate the process of recommending products, services, or information
items (thereafter referred to as items) to users or consumers (thereafter referred to as users) based
on various types of data concerning users, items, and previous interactions between users and
items (Resnick et al. 1997). A wide range of applications can be modeled using the
recommendation framework when one interprets users and items in the broad sense (Breese et
al. 1998; Deshpande et al. 2004; Konstan et al. 1997; McDonald et al. 2000; Pazzani 1999;
Schafer et al. 2001): researchers locating academic papers of potential interest, students looking
for relevant learning materials, buyers searching for products (especially experience goods such
as movies), advertisers identifying potential customers for products and services, Web-based
information systems suggesting links to follow, among others. User-item interactions can also
take different forms such as ratings, paper downloading, product/service purchases, and catalog
and website browsing activities.
Significant academic efforts have been devoted to recommender system-related research
since the concept of collaborative filtering, which makes recommendations only using the user-
item interaction data, was first implemented and popularized around 1995 (Hill et al. 1995;
Resnick et al. 1994; Shardanand et al. 1995). The majority of this body of literature is on
recommendation algorithm design and evaluation. Recent survey paper (Adomavicius et al.
2005), book (Brusilovsky et al. 2007), and special issues (Felfernig et al. 2007; Konstan 2004)
highlight the massive amount of work that has been done and extended interest in this area.
There is also ever increasing industry interest and efforts in improving the performance of real-
world recommender systems, highlighted by the Netflix Prize launched in October (Bennett et al.
2007). A wide range of recommendation algorithms have been developed and numerous
evaluation studies have been conducted to compare the performances of these algorithms on
different recommendation datasets. Many evaluative studies (e.g. (Deshpande et al. 2004;
Herlocker et al. 2004; Huang et al. 2007a)) have collectively revealed that there seems to be no
single best algorithm and relative performances of different algorithms are largely domain- and
data-dependent.
For practitioners who need to select the best algorithm to be applied to an operational
recommender system, there is the need to assess a wide range of algorithms on their data and
4
choose the best. This requires numerous runs of the recommendation data resulted from
potentially very large set of candidate algorithms. For designing new algorithms, numerous runs
of the data on different versions of the new algorithm and benchmark algorithms are also needed.
As recommender systems are being increasingly applied in a wide variety of application
domains and become the key competitive technology for many retailers and service providers,
the scale of these systems increase continuously as new users and items enter the system and
user-item interactions cumulate over time. The sheer size of the recommendation data possessed
by these organizations makes it prohibitively expensive and time-consuming to perform
algorithm evaluation on the entire dataset. As an illustrative example, as of August 2007 Netflix
has collected over 1.9 billion ratings from more than 11.7 million subscribers on over 85
thousand items (Bennett et al. 2007). When considering another key player of the
recommendation technology, Amazon, the size of the recommendation data would probably be
even larger with millions or more users and items in the system. For such large datasets, it is
unacceptably time-consuming to assess the performance of even relatively simple algorithms
such as the commonly used user-based or item-based neighborhood algorithms, not to mention
those more complex model-based algorithms that require computationally intensive model
estimation and prediction.
The common practice is to evaluate algorithms on a sample of the population data. The
100 million ratings Netflix provided for the Netflix Prize Competition is a sample of about 480
thousand randomly-chosen users from the population data. Most research papers that use the
Netflix dataset typically use a subsample of that sample. Other datasets used in the literature are
also samples of population data in almost all cases. Although using the sample data has long
been the standard practice, there is no formal study on sampling of the recommendation data.
Creating a sample by including a random subset of users and all items they have interacted with
(rating, purchase, browsing, etc.) seems to be the most common approach to obtain the sample
data. The understanding on how well the algorithm evaluation results obtained from such a
sample correspond to the results would be obtained from the population data is lacking. There
are potentially other approaches for obtaining the sample. A straightforward second approach
would be selecting a random subset of items and include all users interact with them. There is no
formal investigation into other sampling methods and the quality of the samples the produce.
5
This paper formally studies the sampling problem for recommendation. The most
important type of data involved in the recommendation process is the user-item interaction data,
which can be conveniently viewed as a bipartite graph representation where users and items are
two types of nodes and edges formed between nodes of different types represent user-item
interactions (Huang et al. 2007b). Building on the literature on network/graph sampling, we
investigate a number of sampling methods with user, item or edge orientation. Within our
framework, we also introduce various measures to assess the correspondence of a sample with
the population data with respect to recommendation algorithm performances. In this study, we
focus on collaborative filtering algorithms that only use user-interaction data (the bipartite graph
itself without the user/item attributes). We also focus on transaction-based recommendation
where recommendations are made based on observed transactions (purchases or experiences)
without rating information. We apply this framework to three real-world recommendation
datasets and compare the performances of the sampling methods.
The remainder of the paper is structured as follows. We first provide some background
on recommender systems with details on collaborative filtering algorithms used in this study and
a review of the network/graph sampling literature. We then present the methods for sampling
recommendation data as a bipartite graph, followed by metrics included in our framework for
assess the quality of recommendation samples. In the Experimental Study section, we present
results obtained on the two real-world datasets. Lastly we conclude the paper by summarizing
main contributions and conclusions and pointing out future directions.
2. Background and Related Work
2.1 Recommender Systems
Depending on the input data used, a recommendation algorithm can be roughly categorized into
content-based (using item attributes and interaction data), demographic filtering (using user
attributes and interaction data), collaborative filtering (CF) (using the interaction data only), and
hybrid approaches (using multiple types of input data) (Huang et al. 2004b; Pazzani 1999;
Resnick et al. 1997). The CF approach that only uses the interaction data (Hill et al. 1995;
Resnick et al. 1994; Shardanand et al. 1995) has been acknowledged to be the most commonly-
6
used approach and the majority of recommendation algorithms proposed in the literature falls
into this category (Adomavicius et al. 2005).
In our study, we focus our attention to sampling recommendation for CF algorithm
evaluation. In other words, we assume the only data available in the population data is the user-
item interactions. We further limit the scope of this study to transaction-based collaborative
filtering and leave the rating-based recommendation for future study. Therefore the population
recommendation data in our study are user-item interactions that can viewed as a binary matrix
(elements being 0 or 1) or an undirected unweighted bipartite graph. Correspondingly the
sampling methods studies here are also designed based on this population data. To extend the
study to recommendation that use additional types of data (i.e., user/item attributes), the
sampling methods may also utilize additional information from these other types of data.
Nevertheless, the results reported in this study still provide important insights to other types of
recommendations as well, as the user-item interaction data is the most important input data for
any type of recommendation approaches.
We next provide some details on the CF algorithms used in this study for
recommendation sample evaluation. We first introduce a common notation for describing a CF
problem. The input of the problem is an M N interaction matrix A = (aij) associated with M
users C = {c1, c2,…, cM} and N items P = {p1, p2, …, pN}. We focus on recommendation that is
based on transactional data. That is, aij can take the value of either 0 or 1 with 1 representing an
observed transaction between ci and pj (for example, ci has purchased pj) and 0 absence of
transaction. We consider the output of a collaborative filtering algorithm to be potential scores of
items for individual users that represent possibilities of future transactions. A ranked list of K
items with the highest potential scores for a target user serves as the recommendations.
One basic collaborative filtering algorithm is the well-tested user-based neighborhood
algorithm using statistical correlation (Breese et al. 1998). To predict the potential interests of a
given user, this algorithm first identifies a set of similar users based on correlation coefficients or
similarity measures using the past transactions, and then makes a prediction based on the
behavior of these similar users. The fundamental assumption is that users who have previously
purchased a large set of the same items will continue to buy the same set of new items in the
future. Formally, the algorithm first computes a user similarity matrix WC = (wcst), s, t = 1, 2,
7
…, M. The similarity score wcst is calculated based on the row vectors of A using a vector
similarity function (such as in (Breese et al. 1998)). A high similarity score wcst indicates that
users s and t may have similar preferences since they have previously purchased a large set of
common items. WC·A gives potential scores of the items for each item.
The item-based algorithm (Deshpande et al. 2004) is different from the user-based
algorithm only in that item similarities are computed instead of user similarities. The assumption
here is that items that have been bought by the same set of users will continue to be co-purchased
by other users. The user-based and item-based algorithms are the mostly commonly used CF
algorithms. Formally, this algorithm first computes a item similarity matrix WP = (wpst), s, t = 1,
2, …, N. Here, the similarity score wpst is calculated based on column vectors of A. A high
similarity score wpst indicates that items s and t are similar in the sense that they have been co-
purchased by many users. A·WP gives the potential scores of the items for each user.
The graph-based algorithms, such as the spreading activation algorithm (Huang et al.
2004a), explore longer paths to exploit the transitive user-item associations. The fundamental
assumption is that the behavior of the transitive neighbors (neighbors of the neighbors) is also
informative in predicting the behavior of the users. The spreading activation algorithm starts with
graph-based representation of the interaction matrix. Both the users and items are represented as
nodes each with an activation level j, j = 1, …, N. To generate recommendations for user c, the
corresponding node is set to have activation level 1 (c = 1). Activation levels of all other nodes
are set at 0. After initialization the algorithm repeatedly performs the following activation
procedure:
j(t + 1) =
1
0
n
i
iijs ttf ,
where fs is the continuous SIGMOID transformation function or other normalization functions; tij
equals if i and j correspond to an observed transaction and 0 otherwise (0 < < 1). The
algorithm stops when activation levels of all nodes converge. The final activation levels j of the
item nodes give the potential scores of all items for user c. In essence this algorithm achieves
efficient exploration of the connectedness of a user-item pair within the user-item graph. The
connectedness concept corresponds to the number of paths between the pair and their lengths and
serves as the predictor of occurrence of future interaction.
8
Another commonly studied algorithm is the generative model algorithm (Hofmann 2004;
Ungar et al. 1998).. Under this approach, latent class variables are introduced to explain the
patterns of interactions between users and items. One latent class variable can be used to
represent the unknown cause that governs the interactions between users and items. The
interaction matrix A is considered to be generated from the following probabilistic process: (1)
select a user with probability P(c); (2) choose a latent class with probability P(z|c); and (3)
generate an interaction between user c and item p (i.e., setting acp to be 1) with probability P(p|z).
Thus the probability of observing an interaction between c and p is given by
zzpPczPcPpcP )|()|()(),( .
Based on the interaction matrix A as the observed data, the relevant probabilities and conditional
probabilities are estimated using a maximum likelihood procedure called Expectation
Maximization (EM). Based on the estimated probabilities, P(c, p) gives the potential score of
item p for user c.
For recommendation algorithm evaluation, a given recommendation dataset is typically
split into a training set and a testing set. The training set is used as input data to generate
recommendations, which are compared with transactions involving relevant users and items
appearing in the testing set. In this context, relevant users/items refer to those have appeared in
the training set. For algorithm performance assessment we use the following recommendation
quality metrics from the literature regarding the relevance, coverage, and ranking quality of the
ranked list recommendation (e.g., (Breese et al. 1998)):
precision: Pc = K
hitsofNumber ,
recall: Rc = set testingin the with interacted user items ofNumber
ofNumber
c
hits ,
F: Fc = cc
cc
RP
RP
2 , and
rank score: j hj
cjc
qRS
)1/()1(2,
where j is the index for the ranked list; h is the viewing half-life (the rank of the item on the list
such that there is a 50% chance the user will purchase that item);
otherwise 0,
set, testings'in is if 1, cjqcj .
9
The precision and recall measures are essentially competing measures. As the number of
recommendation K increases, one expects to see lower precision and higher recall. For this
reason, the precision, recall, and F measures might be sensitive to the number of
recommendations. The rank score measure was proposed in (Breese et al. 1998) and adopted in
many follow-up studies (e.g., (Deshpande et al. 2004; Herlocker et al. 2004; Huang et al. 2004a))
to evaluate the ranking quality of the recommendation list. The number of recommendations has
a similar but minor effect on the rank score, as the rank score of individual recommendations
decreases exponentially with the list index.
We also adopt a Receiver Operating Characteristics (ROC) curve-based measure to
complement precision and recall. Such measures are used in several recommendation evaluation
studies (Herlocker et al. 2004; Papagelis et al. 2005). The ROC curve attempts to measure the
extent to which a learning system can successfully distinguish between signal (relevance) and
noise (non-relevance). For our recommendation task, we define relevance based on user-item
pairs: a recommendation that corresponds to a transaction in the testing set is deemed as relevant
and otherwise as non-relevant. The x-axis and y-axis of the ROC curve are the percent of non-
relevant recommendations and the percent of relevant recommendations, respectively. The entire
set of non-relevant recommendations consists of all possible user-item pairs that appear neither
in the training set (not recommended) nor in the testing set (not relevant). The entire set of
relevant recommendations corresponds to the transactions in the testing set. As we increase the
number of recommendations from 1 to the total number of items, a ROC can be plotted based on
the corresponding relevance and non-relevance percentages at each step. The area under ROC
curve (AUC) gives a complete characterization of the performance of the algorithm across all
possible numbers of recommendations. An ideal recommendation model that ranks all the future
purchases of each user at the top of their individual recommendation list would expect to have
steep increase in the beginning and to flat out in the end, with AUC close to 1. A random
recommendation model would be a 45-degree line with AUC at 0.5.
For precision, recall, and F measure, an average value over all users tested was adopted
as the overall metric for the algorithm. For the rank score, an aggregated rank score RS for all
user tested was derived as
10
c c
c c
RS
RSRS
max100 ,
where maxcRS was the maximum achievable rank score for user c if all future purchases had been
at the top of a ranked list. The AUC measure was used directly as an overall measure for all
users.
2.2 Graph Sampling
Our population recommendation data can be viewed as an undirected bipartite graph. This graph
is the only information based on which sampling methods can be developed. Our problem of
sampling recommendation data is equivalent to sampling a bipartite graph. There is a recent
growing literature on graph/network sampling, which primarily focuses on unipartite graphs
where there is only one type of nodes and links are possible between any pair of nodes. Most of
these studies are interested in whether or to which extent the graph samples exhibit similar
topological characteristics of the population graph. In this section, we briefly review the
literature in this area.
There are two streams of research under the name of network/graph sampling. The first
stream deals with sampling from a set of graphs as the population (e.g, (Erdos et al. 1959; Erdos
et al. 1960). The sample in these studies consists of a smaller number of graphs from the
population. The second stream deals with obtain a subgraph as a sample from a population
graph, which is also referred to more specifically as subgraph sampling. Our study deals with
subgraph sampling from a bipartite graph for recommendation algorithm evaluation purpose.
Subgraph sampling studies date back to 1960s in mathematics and sociology. Many
sampling methods were introduced and analyzed since then. Frank (Frank 1978) studied the
random node sampling method, which obtains a random subset of the nodes and includes all
edges among these selected nodes to form the sample graph. Capobianco and Frank (Capobianco
et al. 1982) studied a commonly used sampling method for studying social networks in sociology
literature, the star or ego-centric sampling method. Under this method, seed nodes are randomly
sampled from the population and then all neighbors of the seeds and the seed-neighbor and
neighbor-neighbor links are included to form the sample graph. Another commonly used
sampling method in sociology studies called snowball or chain-referral sampling was studied in
11
(Frank 1979; Goodman 1961). Under this approach, a seed node is selected as the starting point
and the population graph is traversed under a breadth-first fashion (including all neighbors of the
seed and then all neighbors of the neighbors of the seed and so forth). Klovdahl (Klovdahl 1977)
studied the method of random walk sampling, which received substantial interest in follow-up
network sampling studies. This method is similar to snowball sampling as it also traverses the
graph to form a sample. It differs from snowball sampling in that only one randomly chosen
neighbor (instead of all neighbors) of the current node is included into the sample. These early
studies have focused on obtaining theoretical and empirical results on the ability of these
sampling methods to recover simple network characteristics such as density.
Several recent studies have investigated subgraph sampling of large-scale graphs, with
the initial focus on sampling the Internet (e.g., (Lakhina et al. 2003)) and WWW (e.g.,
(Rusmevichientong et al. 2001)). These studies are typically domain-specific and do not directly
apply to sampling generic networks. Several recent studies have addressed the general graph
sampling problem on large-scale graphs with a focus on recovering more complex graph
topological characteristics such as degree distribution, path length, and clustering. Many of these
studies focus on a single graph characteristic or focus on just a few sampling methods (Lee et al.
2006; Newman 2003; Stumpf et al. 2005). Several recent studies provide comprehensive
assessment of a wide range of sampling methods for recovering many graph topological
characteristics. Leskovec and Faloutsos (Leskovec et al. 2006) evaluate the samples generated
from a large number of sampling methods on a number of real-world networks. The sampling
methods they studied include random node selection, random edge selection, random walk,
random jump, and snowball methods. They also introduced a new method call forest fire method,
which falls in between the snowball and random walk methods. A randomly selected subset of
the neighbors of the current node is included to form samples. The main conclusions from their
study are that simple uniform random node sampling performs surprisingly well and the overall
best performing methods are the ones based on random walk and forest fire.
Most of the studies reviewed so far on graph sampling fall into what we refer to as the
edge-burning sampling. Take random walk as an example, only the edges that were visited
during the graph exploration process are included into the sample graph. A recent study on social
networks sampling (Ebbes et al. 2008) investigated the node-burning sampling, which includes
all edges among the nodes visited. In this study, we consider node-burning sampling to be more
12
appropriate as there is no incentive to drop some ratings/interactions for user-item pairs included
into the sample for recommendation algorithm evaluation purpose. The main conclusions from
(Ebbes et al. 2008) are that snowball, random walk, and forest fire are the best performing
sampling methods for social networks of certain degree distribution and density levels.
3. Sampling Recommendation Data
User-item interaction data can be naturally represented as a graph by treating users and items as
nodes, and transactions involving user-item pairs as links between these nodes. This type of
graph is a bipartite graph whose nodes can be divided into two distinct sets and whose links only
make connections between the two sets. The input data for collaborative filtering has been
traditionally represented by an interaction matrix. The example shown in Figure 1, which
involves 3 users, 4 items, and 7 past transactions, shows a direct mapping between the two
representations.
0 1 0 1
1 1 1 0
1 0 1 0
3
2
1
4321
c
c
c
pppp
User-item matrix User-item graph
c3c2
Item Nodes
User Nodes
p1 p2
c1
p4p3
Figure 1: Example user-item interaction matrix and corresponding bipartite user-item
graph and projected user/item graphs
Given a population recommendation data, we use a bipartite graph G = (C, P, E) with the
user node set C = {c1, …, cM}, item node set P = {p1, …, pN}, and edge set E = {<ci, pj >} where
ci C and pjP. The recommendation sampling problem is to find what sampling method
13
produces a subgraph of G that provides best ratio between level of sample-population algorithm
evaluation correspondence and sample size.
Sample size in the graph sampling literature typically refers to the number of nodes
included in the sample graph. In our context of sampling for recommendation, we chose to
define sample size as the number of edges (or user-item interactions) in the sample bipartite
graph. Under this definition the samples recommendation data of the same size have comparable
file size under the relational representation.
The main departure from unipartite graph sampling in our study is that we have two sets
of nodes to sample from, users and items. For the current paper, we study three types of sampling
methods: user-oriented, item-oriented and edge-oriented. Under the user-oriented sampling, we
identify a subset C’ of users from the population user set and include all items interacting with
the selected users to form the sample such that the sample contains λ|E| edges where λ is the pre-
specified sampling ratio,. Under the graph representation, this approach is to include C’, links
from nodes in C’ and item nodes reached by these links. Similarly, under the item-oriented
sampling, we identify a subset P’ of items from the population item set and include all users
interacting with the selected items to form the sample such that the sample contains λ|E| edges .
Under the edge-oriented sampling, we select a subset E’ of the edges from the population edge
set with size λ|E| and include all user and item nodes incident on these edges.
In this paper, we study the following sampling methods: random, random walk, snowball,
and forest fire sampling methods under user-oriented, item-oriented, and edge-oriented sampling.
Among these, the random sampling has been commonly used as the benchmark sampling method
in the graph sampling literature and the user-oriented random sampling is also the common
method used in the existing practice of recommendation data sampling. Random walk, snowball,
and forest fire sampling methods are variants of graph exploration sampling methods that have
been reported to perform well in various network contexts.
Random sampling
The random sampling approach is the most straightforward sampling method. The subset of
users/items/edges is selected uniformly randomly from the population user/item/edge set under
user-oriented/item-oriented/edge-oriented sampling. Since we try to obtain a sample containing a
specific number of edges, under the user-oriented (item-oriented) sampling, we include one user
14
(item) at a time and all user-item edges involved. When adding a user (item) results in more
edges than needed, a random subset of user-item edges involved with that user (item) is included
into the sample. The sample recommendation data used for Netflix Prize Competition is obtained
from a user-oriented random sampling in our terms.
Graph exploration sampling
Under methods in this approach, we select the nodes to include in the sample by navigating the
population network following the links from a randomly selected starting seed node, s. Under the
user-oriented (item-oriented) sampling, we operate this graph navigation on the projected user
(item) graphs. A user (item) graph projected from the user-item graph forms an edge between
two users (items) if the two users (items) interact with at least one common item (user). Figure 1
shows the project user and item graphs with our simple example. Being consistent with the user-
oriented (item-oriented) random sampling, we include one user (item) at a time and the user-item
edge involved until reaching the sample size. Under the edge-oriented sampling, the graph
navigation is operated on the population user-item graph.
Specifically, graph navigation sampling works as follows. Starting from s, these methods
select either a subset or the entire set of unselected neighbors of s and repeat this process for each
node just included into the sample. If the desired number of edges cannot be reached (i.e., the
exploration process has fallen into a small component of the population graph isolated from
others) a new random seed node is selected to re-start the exploration process.
There are several variations of the graph-exploration methods depending on how we
select the neighbors of a node v that is reached via the exploration process. The snowball method
includes all unselected neighbors of v. The random walk method selects exactly one neighbor
uniformly at random from all the unselected neighbors. These two extremes can also be
understood as the well-studied breath-first and depth-first searches from the graph search
literature respectively. Between these two extremes is the forest fire method (Leskovec et al.
2006), which selects x unselected neighbors of v uniformly randomly. In our study we set this
number x to be uniformly drawn from [1, 2, 3]. In our future study we will investigate additional
settings for x. When implementing these methods, due to the nature of the snowball method, we
can simply forbid revisiting the nodes. For random walk and forest fire, nodes need to be
revisited for the exploration to carry on. In order to avoid being stuck in a small component of
15
the population graph, we set the exploration process to jump to another randomly chosen seed
with a probability of 0.05.
4. Evaluation of Recommendation Samples
In the context of our study, the quality of a sample should reveal the extent to which the
recommendation algorithm evaluation results obtained from sample data align with those
obtained from the population data. Given a recommendation algorithm performance measure of
interest as described earlier in the paper, we consider two types of sample quality measures,
performance recovery measure and ranking recovery measures.
We denote the chosen performance measure as m and the measures obtained using a
particular algorithm j on the sample si and population data as mj(si) and mj(p), respectively. The
performance recovery measure for measure m, algorithm j, and sample i is straightforwardly
defined as:
prm,j,i = mj(si) – mj(p)
The ranking recovery measure reflects the alignment between the ranking of the relative
performances of J algorithms considered for the sample and for the population data. We denote
the ranks of algorithm j in terms of measure m for the population p and sample si as rj,m(p) and
rj,m(si), respectively. The ranking recovery measure for measure m, and sample i is defined as:
j mjimjim prsrrr 2
,,, ))()((
Under this definition, we penalize large ranking differences. For example, suppose the ranking
lists of four algorithms for population data is [1,2,3,4] and two samples produce ranking lists of
[2,1,4,3] and [2,3,4,1], the ranking recovery measures for the two samples will be 4 and 12,
respectively. We consider the first sample recovers the rankings relatively better. For four
algorithms, the ranking recovery measure ranges from 0 to 20. It may be argued that in the
previous example, the second sample recovers the ranking better as the ranking among the first
three algorithms are the same as the population results. We leave investigation on other
definitions of raking quality to future work.
For given sample ratio and sample method, the quality of the sample may vary across
different instances of the samples obtained. In order to estimate the quality of single sample, we
16
use multiple samples in the experiment and use the average recovery measures across the
samples to obtain an estimate of the expected recovery measures. For the performance recovery
measure, we define mean absolute performance recovery measure for method m and algorithm j
as:
ijij
jm Ipmsm
mapr|)()(|
,
where I is the number of samples.
Note that the expected recovery measure from multiple samples may not tell the complete
story. This measure could be misleading when certain sampling methods result in small mean
absolute preference recovery measure but with a very large variance. To address this issue, we
also define a percentile absolute performance recovery measure as
paprm,j = maprm,j + k stdev(aprm,j).
Assuming the absolute performance recovery measure of a sample follows a normal distribution,
Pr(aprm,j ≤ paprm,j) = x%. For example, when k is set to 1 the percentile is 84.1%.
To summarize the recovery of performance for multiple algorithms, we average across the
algorithms:
jjm
m Jmapr
mapr , , jjm
m Jpapr
papr ,
Similarly for the ranking recovery function, we define the mean ranking recovery measure and
percentile ranking recovery measure:
iim
m Irr
mrr , , prrm,j = mrrm,j + k stdev(rrm,j).
When multiple samples are available, there is yet another way to recover the population
evaluation results from the results on multiple samples. For performance recovery measure, we
can average across multiple samples and use the mean performance recovery measure. This
makes intuitive sense as we expect the mean to have smaller variance. Note this measure
represents how a set of samples, rather than a single sample, can recover the population
evaluation results. Specifically, we define the mean performance recovery measure for method m
and algorithm j as:
17
ijij
jm Ipmsm
mpr)()(
,
We can similarly obtain the ranking recovery measure based on multiple samples. We
may obtain the ranking list based on the mean performance recovery measure across multiple
samples, which we denote as ranking of mean recovery measure, rmrm.
5. Experimental Study
In this study, we follow our recommendation data sampling framework to investigate the
performance of individual sampling methods on two real-world recommendation datasets.
5.1 Datasets
For our experiments, we used samples from two recommendation datasets: a retail dataset from a
major US online apparel merchant and a movie dataset from the MovieLens Project. For the
movie rating dataset we treated a rating on product pj by consumer ci as a transaction (aij = 1) and
ignored the actual rating. Such adaptation has been adopted in several recent studies such as
(Deshpande and Karypis 2004) and is consistent with the “Who Rated What” task in the 2007
ACM KDD Cup competition. Assuming that a user rates a movie based on her experience with
the movie, we recommend only whether a consumer will watch a movie in the future and do not
deal with the question of whether or not she will like it.
We included users who had interacted with 5 to 100 items for meaningful testing of the
recommendation algorithms. We set this rather challenging minimum number of transactions of
5 to include data that represent the commonly discussed “sparsity problem” in the recommender
system literature. It is often important for a recommender system to deliver recommendation of
acceptable quality for new users who only had limited transaction history. We sampled 1,000
users from the retail and movie datasets for the experiment. The details about the final dataset we
used as population data are shown in Table 1. Density level of a data set reported in the table is
defined as the number of transactions (links between consumer-product pairs) and all possible
transactions ((# of consumers)×(# of products)). The movie dataset has the highest density level
(1.75%), followed by the book dataset (0.19%) and then the retail dataset (0.13%).
18
Table 1: Basic data statistics of the retail and movie datasets
Dataset # of
Consumers # of
Products # of
Transactions Density Level
Avg. # of purchases per consumer
Avg. sales per product
Retail 1,000 7,328 9,332 0.13% 9.33 1.27
Movie 1,000 2,900 50,748 1.75% 50.75 17.5
5.2 Experimental Procedure
For algorithm performance evaluation on the population datasets, we first randomly selected 2/3
of the data into the training set. Within the remaining 1/3 of the transactions, only those
involving users and items appeared in the training set were included into the testing set. We
applied the user-based and item-based neighborhood algorithm, spreading activation algorithm,
and generative model algorithm described earlier and compute the precision, recall, F, rank
score, and AUC measures. We have set the number of recommendations for each user to be 10 in
our study. Table 2 shows the population algorithm evaluation results.
Table 2. Population algorithm evaluation results
precision recall F rank score AUC
Retail
user-based 0.0124 0.0672 0.0201 4.3791 0.5470 item-based 0.0153 0.0853 0.0250 5.6536 0.5800 spreading activation 0.0175 0.0932 0.0279 7.3039 0.6047 generative 0.0102 0.0424 0.0156 1.0784 0.6322
Movie
user-based 0.1983 0.1304 0.1470 26.5333 0.8768
item-based 0.2686 0.1786 0.2005 34.0500 0.9144 spreading activation 0.1649 0.1035 0.1192 22.8833 0.8660
generative 0.1213 0.0768 0.0880 15.2417 0.8435
With a given sampling ratio we obtain 20 samples using random (rand), random walk
(rw), snowball (sb), and forest fire (ff) methods under the user-oriented, item-oriented, and edge-
oriented sampling. With each sample data, we follow exactly the same procedure as the
population data to obtain training and testing sets and obtain algorithm evaluation results with
the four recommendation algorithms. We then obtain the sample quality measures introduced
earlier for comparison. In our study we focus on relatively small sampling ratios as the
19
performance of sampling methods for small samples are most interesting. Specifically, we
studied samples of sizes 4%, 6%, 8%, 10%, and 20%.
5.3Results
Figures 2 and 3 show the sampling results for the retail and movie datasets. We summarize the
main findings from these results below.
Percentile vs. mean recovery measures. Comparing the left panel with the center panel
in Figures 2 and 3 we see that the relative performance of the sampling methods are generally
consistent between percentile recovery and mean recovery measures. This finding excludes the
situation where certain sampling methods have small mean recovery measure with large variance
or large mean recovery measure with small variance in our datasets.
No single best sampling method. Focusing on studying the left panel, we see that the
relative performances of individual sampling methods from the results on the two datasets we
studied are far from a clear picture. The best sampling method varies with respect to dataset,
performance measure interested (F or AUC), and whether performance recovery or ranking
recovery is the goal.
We observe quite different patterns of relative performance of the sampling methods for
the retail and movie datasets. For the retail dataset, the best performing sampling method for the
performance recovery of F and AUC measure and recovery of ranking based F measure is either
edge-oriented random walk or edge-oriented forest fire, while the best method for recovery of
ranking based on AUC measure is user-oriented snowball, with edge-oriented forest fire or edge-
oriented snowball as the second best. There is an interesting reversal for some methods for the
retail dataset. Edge-oriented snowball and user-oriented snowball performed the worst for F
measure performance recovery while edge-oriented random walk was the third worst method
after user-oriented random and edge-oriented random for recovery of ranking based on AUC
measure. It seems that using the edge-oriented forest fire would be the best choice considering all
four objectives together.
For the movie dataset, we observe that the edge-oriented methods generally performed
worst for all four recovery measures. For performance recovery measure of F measure, the user-
oriented and item-oriented sampling methods all lumped together while the user-oriented
20
sampling methods clearly outperformed item-oriented methods for recovery of ranking based on
F measure. For performance recovery of AUC measure we see item-oriented methods
outperform user-oriented methods with edge-oriented snowball sitting in between. For recovery
of ranking based on AUC measure, for samples up to 10%, user-oriented methods are the clear
winner. However, once the sample size reaches 20% item-oriented methods delivered much
better results than user-oriented methods (2.21 from item-oriented random and snowball
compared with 5.61 with user-oriented forest fire).
While further investigation is needed on more datasets, we conjecture that the different
results we see on the retail and movie datasets may be due to the very different sparsity level of
the two datasets.
Single sample vs. mean from multiple samples. Comparing the center panel and right
panel in Figures 2 and 3 we see that using mean from multiple samples generally helps to
recover performance measure and this is even more so the case for recovering ranking of
algorithms. The improvement of performance recovery using multiple samples is more
prominent with the retail dataset. In particular, for recovery of the F measure using mean of 20
samples of size 4% from the best method (edge-oriented forest fire sampling) is comparable with
a single sample of size 20% from the best sampling methods while using mean of 20 samples of
size 6% from the best method (user-oriented random sampling) achieved much better
performance recovery of 0.004 than that of best single sample of size 20% (0.01). For recovery
of the AUC measure, the improvement is even more prominent. Using mean of 20 samples of
size 4% from the best method (user-oriented random walk) achieved much better performance
recovery of 0.017 than that of the best single sample of size 20% from edge-oriented forest fire
(0.03). The improvement for the movie dataset was much less and we do not see using mean
from 20 smaller samples would outperform using a single sample of size 20%.
Using mean performance measures from multiple samples to recover the ranking of the
algorithms works even better than using single samples. For the F measure of the retail dataset,
ranking of mean recovery measure was worse than mean rank recovery measure for most
sampling methods. But the best performing method of user-oriented random sampling achieved
the ranking of mean recovery measure of 4 for the sample sizes of 4%, 6%, and 10% and the
ranking of mean recovery measure of 2 for the sample sizes of 8% and 20%. These results are
21
much better than the best single sample ranking recovery measures. For the AUC measure of the
retail dataset, the improvement is striking in that more than half of the methods achieved perfect
recovery of ranking of algorithms for all sample sizes and among the remaining five methods
two (item oriented random sampling and item oriented random walk) had the worst ranking of
mean recovery measure of 2 across the sample sizes. For the F measure of the movie dataset, we
see that the four user-oriented sampling methods delivered best result for mean rank recovery
measures and using mean of multiple samples help these methods to reach perfect recovery of
ranking fairly soon. All four methods had perfect ranking of mean recovery measure for sample
sizes greater than 6% and user oriented random walk even had perfect ranking of mean recovery
measure for sample size of 4%. For the AUC measure of the movie dataset, it turned out that it is
more difficult to recover ranking here. We do not see much improvement in terms of the best
recovery that can be obtained by using mean from multiple samples for most cases. The item
oriented forest fire did have perfect ranking of mean recovery measure for the sample size of
20% while the best mean rank recovery measure is 1.2 of item-oriented random and item
oriented snowball 20% samples.
Performance of user-oriented random sampling. User-oriented random sampling is used
in almost all academic studies and practical research on recommendation algorithm evaluation
(include the Netflix Prize dataset). Based on our results, this sampling method is far from the
overall best. Particularly for the retail dataset, it is among the two or four methods for the four
recovery measures for individual samples. When using mean from multiple samples, it had the
best performance for recovery of F measure and ranking based on F measure, medium
performance for recovery of AUC measure and worst performance for recovery of ranking based
on AUC measure. For the movie dataset, using user-oriented random sampling is acceptable for
recovery of F measure and ranking based on F or AUC measure but not for recovery of AUC
measure using single samples or mean from multiple samples.
22
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
0
0.01
0.02
0.03
0.04
0.05
0.06
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
0
0.02
0.04
0.06
0.08
0.1
0.12
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
F - precentile absolute performance recovery F - mean absolute performance recovery F - mean performance recovery
0
2
4
6
8
10
12
14
16
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
0
2
4
6
8
10
12
14
16
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
0
5
10
15
20
25
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
F - precentile rank recovery F - mean rank recovery F - ranking of mean recovery
0
0.05
0.1
0.15
0.2
0.25
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
0
0.05
0.1
0.15
0.2
0.25
0.3
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
AUC - precentile absolute performance recovery AUC - mean absolute performance recovery AUC - mean performance recovery
0
2
4
6
8
10
12
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
0
2
4
6
8
10
12
14
16
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
0
2
4
6
8
10
12
14
16
18
20
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
AUC - precentile rank recovery AUC - mean rank recovery AUC - ranking of mean recovery
Figure 2: Recovery measures for the retail dataset
23
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
F - precentile absolute performance recovery F - mean absolute performance recovery F - mean performance recovery
0
5
10
15
20
25
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
0
5
10
15
20
25
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
0
5
10
15
20
25
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
F - precentile rank recovery F - mean rank recovery F - ranking of mean recovery
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
AUC - precentile absolute performance recovery AUC - mean absolute performance recovery AUC - mean performance recovery
0
5
10
15
20
25
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
0
5
10
15
20
25
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
0
5
10
15
20
25
4% 6% 8% 10% 20%
user rand
user rw
user sb
user ff
item rand
item rw
item sb
item ff
edge rand
edge rw
edge sb
edge ff
AUC - precentile rank recovery AUC - mean rank recovery AUC - ranking of mean recovery
Figure 2: Recovery measures for the movie dataset
24
Practical recommendations. Based on the results from both datasets, it seems that there
is no clear cut answer on which sampling method is the best to use. Based on the understanding
that the AUC measure look at quality of top-K recommendations for all K values and is more a
theoretical evaluation metric we use our results based on the F measure to summarize some
practical recommendations. We also limit our focus to recovery of ranking of algorithms for
these recommendations as that is most often the practical objective of conducting an algorithm
evaluation study. The best sampling method to use seems to depend on the density situation of
the population data. If the data is very sparse, the edge-oriented forest fire or edge-oriented
random walk methods are the best. For denser data, user-oriented sampling methods work best
for recovery of ranking of algorithms with the best specific method varies with the sample size.
Using multiple small samples and then taking the average across samples often is much favored
over using a larger sample. Taking such an approach will often drastically improve the recovery
of ranking of algorithm. We recommend using mean from multiple user-oriented random
samples for sparse data and using mean from multiple user-oriented random walk samples for
dense data. These recommendations are based on the two datasets we have and general
applicability needs to be further verified with additional empirical evidences.
6. Conclusion and Future Research
In this paper we study sampling of recommendation data for algorithm performance evaluation.
Sampling is the common practice involved in academic and industry efforts on recommendation
algorithm evaluation and selection. Experimental analysis often uses a subset of the entire user-
item interaction data available in the operational recommender system, often derived from
including all transactions associated with a subset of uniformly randomly selected users. To the
best of our knowledge, there is no prior formal study attempted to understand to what extend
algorithm evaluation results obtained from such a sample correspond with those would have
been obtained from the population data. Our paper introduces this sampling problem for
recommendation. We use a bipartite graph to represent the key input data for recommendation
algorithms, the user-item interaction data, and build on the literature on graph sampling to
develop our analytical framework. We adapted the sampling methods reported to be of good
25
quality on unipartite graph in the literature and extend them to our context of bipartite graph
sampling. The sampling methods included in our study are random, random walk, snowball, and
forest fire methods under user-oriented, item-oriented, and edge-oriented sampling. We also
developed a series of metrics of the quality of a given sample with respect to recommendation
algorithm evaluation and selection. These metrics include performance recovery and ranking
recovery measures for assessing both single-sample and multiple-sample recovery performances.
We applied this framework on two real-world recommendation datasets. Based on the empirical
results we provide some general practical recommendations for sampling for recommendation
algorithm evaluation targeted at recovering the ranking of algorithms. Our key findings are that
for population data with different density level different sampling methods are recommended:
edge-oriented forest fire or edge-oriented random walk for sparse data and user-oriented methods
for dense data and that using mean from multiple smaller samples often works better than using a
single large sample.
This study represents the first step towards the comprehensive investigation on the
problem of sampling for recommendation. The current work is limited in several aspects, each of
which points to need for extensive future research. Firstly, we need to further our understanding
on why certain sampling methods excelled for certain algorithm evaluation goals. Secondly, we
need to use additional datasets of different characteristics and different scale to evaluate the
generalizability of our current conclusions. Thirdly, we also want to expand our study to include
more recommendation algorithms, to investigate the cases of rating-based collaborative filtering
recommendation and even hybrid recommendation methods that also utilize user/item attributes.
26
References
Adomavicius, G., and Tuzhilin, A. "Towards the next generation of recommender systems: A
survey of the state-of-the-art and possible extensions," IEEE Transactions on Knowledge
and Data Engineering (17:6) 2005, pp 734-749.
Bennett, J., and Lanning, S. "The Netflix Prize," KDD Cup and Workshop 2007, San Jose, CA,
2007.
Breese, J.S., Heckerman, D., and Kadie, C. "Empirical analysis of predictive algorithms for
collaborative filtering," Fourteenth Conference on Uncertainty in Artificial Intelligence,
Morgan Kaufmann, Madison, WI, 1998, pp. 43-52.
Brusilovsky, P., Kobsa, A., and Nejdl, W. Adaptive Web: Methods and Strategies of Web
Personalization Springer, Berlin, 2007.
Capobianco, M., and Frank, O. "Comparison of statistical graph-size estimators," Journal of
Statistical Planning and Inference (6:1) 1982, pp 87-97.
Deshpande, M., and Karypis, G. "Item-based top-N recommendation algorithms," ACM
Transactions on Information Systems (22:1) 2004, pp 143-177.
Ebbes, P., Huang, Z., Rangaswamy, A., and Thadakamalla, H.P. "Sampling of large-scale social
networks: Insights from simulated networks.," 18th Annual Workshop on Information
Technologies and Systems, Paris, France, 2008.
Erdos, P., and Renyi, A. "On random graphs," Publicationes Mathematicae (6) 1959, pp 290-
297.
Erdos, P., and Renyi, A. "On the evolution of random graphs," Publications of the Mathematical
Institute of the Hungarian Academy of Sciences (5) 1960, pp 17-61.
Felfernig, A., Friedrich, G., and Schmidt-Thieme, L. "Guest Editors' Introduction: Recommender
Systems," IEEE Intelligent Systems (22:3) 2007, pp 18-21.
Frank, O. "Sampling and estimation in large social networks," Social Networks (1) 1978, pp 91-
101.
Frank, O. "Estimatiing of population totals by use of snowball sampling," in: Perspectives on
Social Network Research, P.W. Holland and S. Leinhardt (eds.), Academic Press, New
York, 1979, pp. 319-347.
Goodman, L.A. "Snowball sampling," Annals of Mathematical Statistics (32) 1961, pp 148-170.
27
Herlocker, J.L., Konstan, J.A., Terveen, L.G., and Riedl, J.T. "Evaluating collaborative filtering
recommender systems," ACM Transactions on Information Systems (22:1) 2004, pp 5-53.
Hill, W., Stead, L., Rosenstein, M., and Furnas, G. "Recommending and evaluating choices in a
virtual community of use," ACM Conference on Human Factors in Computing Systems,
CHI'95, 1995, pp. 194-201.
Hofmann, T. "Latent semantic models for collaborative filtering," ACM Transactions on
Information Systems (22:1) 2004, pp 89-115.
Huang, Z., Chen, H., and Zeng, D. "Applying associative retrieval techniques to alleviate the
sparsity problem in collaborative filtering," ACM Transactions on Information Systems
(TOIS) (22:1) 2004a, pp 116-142.
Huang, Z., Chung, W., and Chen, H. "A graph model for e-commerce recommender systems,"
Journal of the American Society for Information Science and Technology (JASIST) (55:3)
2004b, pp 259-274.
Huang, Z., Zeng, D., and Chen, H. "A comparative study of recommendation algorithms for e-
commerce applications," IEEE Intelligent Systems (22:5) 2007a, pp 68-78.
Huang, Z., Zeng, D.D., and Chen, H. "Analyzing consumer-product graphs: Empirical findings
and applications in recommender systems," Management Science (53:7) 2007b, pp 1146-
1164.
Klovdahl, A.S. "Social networks in an uban area: First Canberra study," Australian and New
Zealand Journal of Sociology (13) 1977, pp 169-175.
Konstan, J.A. "Introduction to recommender systems: Algorithms and Evaluation," ACM
Transactions on Information Systems (22:1) 2004, pp 1-4.
Konstan, J.A., Miller, B.N., Maltz, D., Herlocker, J.L., Gordon, L.R., and Riedl, J. "GroupLens:
Applying collaborative filtering to Usenet news," Communications of the ACM (40:3)
1997, pp 77-87.
Lakhina, A., Byers, J.W., Crovella, M., and Xie, P. "Sampling biases in IP topology
measurements," 22nd Annual Joint Conference of the IEEE Computer and
Communications Societies, 2003, pp. 332-341.
Lee, S.H., Kim, P.-J., and Jeong, H. "Statistical properties of sampled networks," Physics Review
Letters E (73) 2006, p 016102.
28
Leskovec, J., and Faloutsos, C. "Sampling from large graphs," 12th ACM SIGKDD international
conference on Knowledge discovery and data mining, Philadelphia, PA, 2006, pp. 631-
636.
McDonald, D., and Ackerman, M. "Expertise recommender: A flexible recommendation system
and architecture," ACM Conference on Computer Supported Cooperative Work,
Philadelphia, 2000, pp. 231-240.
Newman, M.E.J. "Ego-centered networks and the ripple effect," Social Networks (25) 2003, pp
83-95.
Papagelis, M., Plexousakis, D., and Kutsuras, T. "Alleviating the sparsity problem of
collaborative filtering using trust inferences," 3rd International Conference on Trust
Management (iTrust 2005), Paris, France, 2005, pp. 224-239.
Pazzani, M. "A framework for collaborative, content-based and demographic filtering," Artificial
Intelligence Review (13:5) 1999, pp 393-408.
Resnick, P., Iacovou, N., Suchak, M., Bergstorm, P., and Riedl, J. "GroupLens: An open
architecture for collaborative filtering of netnews," ACM Conference on Computer-
Supported Cooperative Work, 1994, pp. 175-186.
Resnick, P., and Varian, H. "Recommender systems," Communications of the ACM (40:3) 1997,
pp 56-58.
Rusmevichientong, P., Pennock, D., Lawrence, S., and Giles, C.L. "Methods for sampling pages
uniformly from the WWW," AAAI Fall Symposium on Using Uncertainty Within
Computation, 2001, pp. 121-128.
Schafer, J., Konstan, J., and Riedl, J. "E-commerce recommendation applications," Data Mining
and Knowledge Discovery (5:1-2) 2001, pp 115-153.
Shardanand, U., and Maes, P. "Social information filtering: Algorithms for automating word of
mouth," ACM Conference on Human Factors in Computing Systems, Denver, CO., 1995,
pp. 210-217.
Stumpf, M.P.H., Wiuf, C., and May, R.M. "Subnets of scale-free networks are not scale free:
Sampling properties of networks," Proceedings of National Academy of Science (102:12)
2005, pp 4221-4224.
Ungar, L.H., and Foster, D.P. "A formal statistical approach to collaborative filtering,"
Conference on Automated Learning and Discovery (CONALD), Pittsburgh, PA, 1998.