View
2
Download
0
Category
Preview:
Citation preview
The Wisdom of Crowds and Its Applications to Business Intelligence
Chien Chin ChenDepartment of Information Management
National Taiwan University
Introduction
• With the rapid development of the World Wide Web, people can now obtain and share knowledge easily through a variety of online publishing tools, such as weblogs and online forums.
• The Web has thus become a valuable and abundant information source that has a significant effect on users’ lifestyles, especially their purchasing behavior (Vosen and Schmidt,
2011).S. Vosen, T. Schmidt, Forecasting private consumption: survey-based indicators vs. Google trends, Journal of Forecasting, (2011) in press.
Introduction (cont.)• To help users acquire information about desired items efficiently and
motivate them to purchase, a great deal of e-commerce research has been devoted to improving recommendation systems.
• Collaborative filtering is one of the most popular recommendation techniques.– Like-minded people prefer similar items!– CF recommends items by aggregating the historical item preferences (i.e., item
ratings) of reference users stored in a database. – Because collaborative filtering is effective, it is utilized by many e-commerce
websites, such as Amazon.com
Introduction (cont.)• Luarn and Lin (2003) posited that user loyalty is the key to the
success of e-services and e-commerce systems.– In the case of new users, providing them with useful information
usually creates a sense of belonging and loyalty, and encourages them to visit e-commerce systems frequently.
• For B2C services, it is therefore essential that new users are provided with appropriate item recommendations.
P. Luarn, H.H. Lin, A Customer Loyalty Model for E-Service Context, Journal of Electronic Commerce Research, 4 (2003) 156-167.
Introduction (cont.)• However, the recommendation is a difficult task because of the
cold start phenomenon. – New users require time to become familiar with recommendation
systems.– The systems usually have limited profiles of such users. – Consequently, it is difficult for CF to identify effective reference users
and make useful recommendations to new users.
Introduction (cont.)• Most e-commerce systems allow users to establish social
relationships. – at Epinions.com, users can create trust and distrust lists.
• Generally, people believe the users on their trust lists (or users
trusted by their trusted users).
• In this research, we utilize the web of trust/distrust to identify trustworthy users.– The identified experts are exploited as reference users to make useful
recommendations for cold start users.
The Cold Start Recommendation Method
• The method consists of two stages: – The model construction stage– The recommendation stage.
User Model Construction• To partition the experienced users into different clusters.
– So that users in the same cluster have similar item preferences.
• Let C = {c1, c2, …, cK} denote a set of user clusters. – A cluster ck comprises users with similar item ratings. – We employ the K-means algorithm to cluster experienced users.
0,0
2,
0,0
2,
0,0,,
,,,,
,,
)()(
)()(),(
mkmnmkmn
mkmn
cukmk
cunmn
cukmknmn
kn
ccuu
ccuucusim
kn cu
nk
k uc
c||
1
similarity between user un and cluster ck vector representation of cluster ck
Identify Cluster (Domain) Experts
• We need to measure each user’s trust score in a cluster.
• The web of trust of cluster ck is a directed graph Gk = (ck, ek):– The users in ck form a set of web nodes – ek = {(ui, uj)} is a set of directed edges.
• A directed edge (ui, uj) specifies that user uj is trusted by user ui.
DGK…
A’s trust list
AD
G
K
Web of Trust
Identify Cluster (Domain) Experts (cont.)
• In a web of trust, users trusted by a large number of trusted users are regarded as cluster experts.– The concept can be expressed by the following equation:
• It is a recursive function!! – The trust score of a user depends on the scores of trusted users.
• Problems:– How do we initiate the trust score of a user?– Will the recursion converge to an UNIQUE rk?
kji euu i
ikjk u
rr
),(
,, )deg(
)1(||
1 kc L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank
Citation Ranking: Bringing Order to the Web, TR, Stanford, 1998.
Identical to the well-known PageRank algorithm(Page et al., 1998)
Identify Cluster (Domain) Experts (cont.)
• Erkan and Radev (2004) showed that rk will converge to a unique stationary distribution regardless of the choice of initialized vector.
• After rk has converged, rk,j indicates the trust score of user uj in ck. – The score is a non-negative number.
G. Erkan, D.R. Radev, LexRank: graph-based lexical centrality as salience in text summarization, J. Artif. Int. Res., 22 (2004) 457-479.
Identify Cluster (Domain) Experts (cont.)
• Similarly, we construct the web of distrust of a cluster ck. – Apply the above algorithm to it to calculate the distrust score xk,j of
each user in a cluster.
• Then, we refine the trust vector rk by subtracting it from xk. – As rk and xk are stationary distributions, the refined trust scores are
within the range [-1,1]– The values indicate the trust degree of the users in a cluster.
• We rank the users based on their refined trust scores, and select the top-ranked users to compile an expert list expk of cluster ck.
Implicit Links Identification
• We observed that many trust lists are very short. – Users often have privacy concerns.– Or … users are lazy …
• Even if users do not compile trust lists, their activities reveal their trust orientations. – We propose an Implicit Link Identification algorithm to identify the
missing links in a web of trust.
Implicit Links Identification (cont.)
• Specifically, we construct an implicit link from ui to uj if uifrequently gives a high rating to the items recommended by uj.
Cold Start Recommendation• we predict the possible rating , of an un-rated item im for a
cold start user un.
)(
)(
, )(
, )(
exp ),(
exp ,),(
0;1 exp ),(
0;1 exp ,),(
,
)()1(
)(ˆ
miRj m
miRj m
yn yiRj y
yn yiRj y
u jiR
u jmjjiR
u My u jiR
u My u jmjjiR
nmn r
uur
r
uuruu
)(maxarg)( ,1
ky cu
jyKk
j usigniR
Identify the cluster that is most closely related to item ij
Recommendation made by experts of imRecommendation made by experts of user profile
Experiment• We used the Epinions dataset to evaluate the performance of
the proposed system.
# of experienced users
# of cold start users
# of reviews
# of review ratings
# of trust relations
# of distrust relations
Avg. trust list length – experienced users
Avg. trust list length – cold start users
49,413
71,073
1,560,144
13,668,319
717,667
120,718
12.214
0.679
Statistics of the Epinions Dataset
Experiment (cont.)• We use the conventional leave-one-out procedure.
– For each cold start user, we evaluate the recommendation performance over multiple runs.
– Each evaluation run treats one rated review as an un-rated item and predicts a rating for it by using the information about the remaining rated reviews
• The evaluation metrics are the coverage rate, the mean absolute error (MAE), and the execution time.
Comparison with Other Recommendation Methods
MAE Coverage Rate Execution Time (ms.)
Collaborative Filtering 0.68 3.20% 83.3
MoleTrust‐0 0.69 5.78% 13.5
MoleTrustMaven‐0 0.69 6.03% 21.9
MoleTrustFreq‐0 0.69 6.31% 22.0
MoleTrustConn‐0 0.69 6.14% 21.9
TidalTrust 0.75 34.68% 5873.1
Our method (|expk|=1) 0.71 19.95% 15.4
MAE Coverage Rate Execution Time (ms.)
MoleTrust‐2 0.75 26.63% 1145.3
MoleTrustMaven‐2 0.75 30.90% 2556.3
MoleTrustFreq‐2 0.75 31.16% 2202.3
MoleTrustConn‐2 0.75 31.23% 2846.2
TidalTrust 0.75 34.68% 5873.1
Our method (|expk|=4) 0.73 36.35% 16.9
Conclusions• Research on cold start recommendations is important because
retaining new users is the key to the success of e-services and e-commerce systems.
• We have proposed a cold start recommendation method that analyzes the social relationships of e-commerce users in order to identify trustworthy experts and derive useful recommendations for new users.
References• Chien Chin Chen and Yu-Hao Wan, “An Effective Cold Start
Recommendation Method Using Trust and Distrust Networks,” under revision at Information Sciences.
• Yu-Hao Wan and Chien Chin Chen, "An Effective Cold Start Recommendation Method Using a Web of Trust," in Proceedings the 15th Pacific Asia Conference on Information Systems (PACIS 2011), paper 205, 2011.
Introduction• Web search engines have become important tools that enable
people to acquire various types of information.
• Baeza-Yates and Tiberi’s (2007) assertion that, if people’s search activities on the Web are deliberate, then the query logs of search engines would be valuable sources of information to analyze social activities. – In other words, the query logs are effective sensors of social
activities.Baeza-Yates, R., & Tiberi, A., 2007. Extracting semantic relations from query logs, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. San Jose, California, USA, pp. 76-85.
Introduction (cont.)• In the past, query logs were the private property of search
engine companies …
• Now, it is possible to obtain the latest query information through a number of online Web services. For example, Google Insights.– http://www.google.com/insights/search
• Broadly speaking, query logs can be considered as a kind of Web 2.0 data.– Internet users can access query information freely and they can contribute their opinions
when performing Web searches.
Introduction (cont.)• Ginsberg et al. (2009) employed query logs for efficient
influenza surveillance. The proposed linear model outperforms other methods used to predict influenza outbreaks.
• Since then, many works used query logs to predict unemployment rates, consumption, retail sales, vehicle sales, real estate sales, and travel package sales.
J. Ginsberg, M.H. Mohebbi, R.S. Patel, L. Brammer, M.S. Smolinski, L. Brilliant, Detecting influenza epidemics using search engine query data, Nature 457 (2009) 1012–1014.
Introduction (cont.)• However … most approaches rely on experts to compile a set
of query terms.
• Can we identify representative query terms automatically?
• In this work, we utilize the query logs of search engines for business cycle modeling.– Instead of consulting domain experts, we propose a feature (term)
selection method to identify representative query terms from a large set of business articles.
Monitor Indicator• The Council for Economic Planning Department (CEPD) maintains an
indicator, called Monitor Indicator, which indicates the monthly status of Taiwan’s business cycle.– Five status classes: 紅燈 黃紅燈 綠燈 黃藍燈 藍燈
– There is a one-month lag in the release of all the business indices and indicators compiled by CEPD.
– the time-lag problem increases the uncertainty for governments and industries to establish appropriate business policies.
• We define business cycle surveillance as a supervised classification problem.– That is, given the query frequencies of representative query terms, we need to
predict the status class of business cycles.
Feature Selection• For each extracted noun wk, we treat its query frequencies Fk =
{fk,1, fk,2, …, fk,n} as a time series. – fk,j is the query frequency of wk downloaded from Google Insights at
time (month) j. – It is an integer in the range [0-100].
• Let M = {m1, m2, …, mn} be a set of Monitor Indicator values published by the CEPD.– mj is the Monitor Indicator value at time (month) j.– Its range is [9-45].
Feature Selection (cont.)• For each wk, we treat the pair <wk,xk> as a surveillance feature.
– The variable xk is a non-negative integer that determines the length of the lag in wk’s query frequencies.
• We calculate the representativeness of a feature <wk,xk> by the following equation:
n
xj kjn
xj kkxjk
n
xj kjkkxjk
kkkkk
kk k
k k
nxMmxnFf
nxMmxnFf
nxMxnFcorrxwrep
12
12
,
1 ,
]),1[(]),1[(
]),1[(]),1[(
]),1[],,1[(),(
Feature Selection (cont.)• If corr(Fk[1,n–xk],M[xk+1,n]) > 0,
– The surveillance feature <wk,xk> and the business cycle are positively correlated.
– Users tend to query wk xk months before the expansion of a business cycle.
• What about corr(Fk[1,n–xk],M[xk+1,n]) < 0 ??
• Both positively and negatively correlated features are useful for constructing a business status classifier in which rep(<wk,xk>) outputs the absolute correlation coefficient.
• We select the top K representative terms to construct a business status classifier for comparison.
Classification Model• We adopt the Naive Bayes model to construct our business
cycle surveillance system.
• Naive Bayes predicts the most likely business cycle status as follows:
)|(maxargˆ1
tltLl
OcsPc KxtKxtxtt fffO ,,2,1 ,...,,
21
frequency observation vector
K
k lxtklLl
ltlLl
cfPcP
cOPcP
k1 ,1
1
)|()(maxarg
)|()(maxarg
can be acquired from a training dataset
Data Sparseness Problem• Unfortunately, the Naive Bayes classification is affected by the sparseness of query
frequencies.
• We employ two popular data discretization methods, namely, the equal-width method (Han and Kamber, 2006) and the ChiMerge method (Kerber, 1992).
Experiment• We collected 60 of the CEPD’s Monitor Indicator status classes for a 5-year
period (2005 to 2009) and used them to construct and test the system.
• We downloaded 11,228 business-related articles published between 2005 and 2009 by the Liberty Times.
• To derive credible evaluation results, we utilize the leave-one-out cross validation method.
Number of evaluated business statuses 60
Number of news articles 11,228
Number of extracted terms 58,689
Number of candidate terms 6,224
Number of query frequencies 3,521,340
Experiment (cont.)The effect of K – the number of selected terms:
the setting includes the frequency lags for a three-month period (i.e., a quarter)
• The precision rate decreases as the number of selected features (i.e., K) increases.
• lag-012 is superior to lag-0 when K > 10.• Many of them are based on the same term and only differ in the length of the lag.
Experiment (cont.)The effect of discretization:
• Frequency discretization is very important for constructing a reliable business status classifier.
• To avoid the zero probability problem.
Experiment (cont.)Comparisons with other Feature Selection Methods:
• The terms selected by Freq may be too common to be used in discriminating the status of business cycles.
• The terms selected by PMI-IR are strongly associated with business cycles.• However, such terms are very specific, and they are rarely used to query business cycle information.
Query Term Analysis
wk xk rep pos./neg.
獎助金 (grant) 0 0.693 negative
買點 (buy in) 0 0.666 negative
大華 (Grand Cathay Securities Corporation) 0 0.644 positive
現增 (capital increased by cash) 0 0.630 positive
護盤 (stabilization) 0 0.619 negative
詳情 (inside information) 0 0.617 negative
救濟金 (relief fund) 0 0.614 negative
內需 (domestic demand) 0 0.611 negative
買賣 (trading) 0 0.598 positive
空缺 (vacancy) 0 0.595 negative
wk xk rep pos./neg.
獎助金(grant) 1 0.752 negative
促銷價 (discount) 2 0.746 negative
民政處 (Civil Affairs Department) 2 0.739 negative
民政處 (Civil Affairs Department) 1 0.736 negative
獎助金(grant) 2 0.734 negative
買點 (buy in) 1 0.731 negative
詳情 (inside information) 1 0.730 negative
合作案 (case) 1 0.725 negative
貸款 (loan) 1 0.724 negative
跡象 (sign) 1 0.723 negative
Features Selected under lag-0
Features Selected under lag-012
Conclusions• We have presented a novel business cycle surveillance system that utilizes
the query logs of search engines to predict the status of a business cycle.
• We have developed a feature selection method based on the correlation between query frequencies and business cycles.
• One advantage of the proposed business surveillance system is that query logs are readily available through online Web services;
– hence, the system can provide timely business cycle information to reduce the uncertainty inherent in the business policy decision-making process.
Recommended