45
The Wisdom of Crowds and Its Applications to Business Intelligence Chien Chin Chen Department of Information Management National Taiwan University

The Wisdom of Crowds and Its Applications to Business

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

The Wisdom of Crowds and Its Applications to Business Intelligence

Chien Chin ChenDepartment of Information Management

National Taiwan University

Outline• Cold Start Recommendation• Trend Prediction Using the Query Logs of Search Engine

An Effective Cold Start Recommendation Method Using Trust and Distrust Networks

Introduction

• With the rapid development of the World Wide Web, people can now obtain and share knowledge easily through a variety of online publishing tools, such as weblogs and online forums.

• The Web has thus become a valuable and abundant information source that has a significant effect on users’ lifestyles, especially their purchasing behavior (Vosen and Schmidt,

2011).S. Vosen, T. Schmidt, Forecasting private consumption: survey-based indicators vs. Google trends, Journal of Forecasting, (2011) in press.

Introduction (cont.)• To help users acquire information about desired items efficiently and

motivate them to purchase, a great deal of e-commerce research has been devoted to improving recommendation systems.

• Collaborative filtering is one of the most popular recommendation techniques.– Like-minded people prefer similar items!– CF recommends items by aggregating the historical item preferences (i.e., item

ratings) of reference users stored in a database. – Because collaborative filtering is effective, it is utilized by many e-commerce

websites, such as Amazon.com

Introduction (cont.)• Luarn and Lin (2003) posited that user loyalty is the key to the

success of e-services and e-commerce systems.– In the case of new users, providing them with useful information

usually creates a sense of belonging and loyalty, and encourages them to visit e-commerce systems frequently.

• For B2C services, it is therefore essential that new users are provided with appropriate item recommendations.

P. Luarn, H.H. Lin, A Customer Loyalty Model for E-Service Context, Journal of Electronic Commerce Research, 4 (2003) 156-167.

Introduction (cont.)• However, the recommendation is a difficult task because of the

cold start phenomenon. – New users require time to become familiar with recommendation

systems.– The systems usually have limited profiles of such users. – Consequently, it is difficult for CF to identify effective reference users

and make useful recommendations to new users.

Introduction (cont.)• Most e-commerce systems allow users to establish social

relationships. – at Epinions.com, users can create trust and distrust lists.

• Generally, people believe the users on their trust lists (or users

trusted by their trusted users).

• In this research, we utilize the web of trust/distrust to identify trustworthy users.– The identified experts are exploited as reference users to make useful

recommendations for cold start users.

The Cold Start Recommendation Method

• The method consists of two stages: – The model construction stage– The recommendation stage.

User Model Construction• To partition the experienced users into different clusters.

– So that users in the same cluster have similar item preferences.

• Let C = {c1, c2, …, cK} denote a set of user clusters. – A cluster ck comprises users with similar item ratings. – We employ the K-means algorithm to cluster experienced users.

0,0

2,

0,0

2,

0,0,,

,,,,

,,

)()(

)()(),(

mkmnmkmn

mkmn

cukmk

cunmn

cukmknmn

kn

ccuu

ccuucusim

kn cu

nk

k uc

c||

1

similarity between user un and cluster ck vector representation of cluster ck

Identify Cluster (Domain) Experts

• We need to measure each user’s trust score in a cluster.

• The web of trust of cluster ck is a directed graph Gk = (ck, ek):– The users in ck form a set of web nodes – ek = {(ui, uj)} is a set of directed edges.

• A directed edge (ui, uj) specifies that user uj is trusted by user ui.

DGK…

A’s trust list

AD

G

K

Web of Trust

Identify Cluster (Domain) Experts (cont.)

• In a web of trust, users trusted by a large number of trusted users are regarded as cluster experts.– The concept can be expressed by the following equation:

• It is a recursive function!! – The trust score of a user depends on the scores of trusted users.

• Problems:– How do we initiate the trust score of a user?– Will the recursion converge to an UNIQUE rk?

kji euu i

ikjk u

rr

),(

,, )deg(

)1(||

1 kc L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank

Citation Ranking: Bringing Order to the Web, TR, Stanford, 1998.

Identical to the well-known PageRank algorithm(Page et al., 1998)

Identify Cluster (Domain) Experts (cont.)

• Erkan and Radev (2004) showed that rk will converge to a unique stationary distribution regardless of the choice of initialized vector.

• After rk has converged, rk,j indicates the trust score of user uj in ck. – The score is a non-negative number.

G. Erkan, D.R. Radev, LexRank: graph-based lexical centrality as salience in text summarization, J. Artif. Int. Res., 22 (2004) 457-479.

Identify Cluster (Domain) Experts (cont.)

• Similarly, we construct the web of distrust of a cluster ck. – Apply the above algorithm to it to calculate the distrust score xk,j of

each user in a cluster.

• Then, we refine the trust vector rk by subtracting it from xk. – As rk and xk are stationary distributions, the refined trust scores are

within the range [-1,1]– The values indicate the trust degree of the users in a cluster.

• We rank the users based on their refined trust scores, and select the top-ranked users to compile an expert list expk of cluster ck.

Implicit Links Identification

• We observed that many trust lists are very short. – Users often have privacy concerns.– Or … users are lazy …

• Even if users do not compile trust lists, their activities reveal their trust orientations. – We propose an Implicit Link Identification algorithm to identify the

missing links in a web of trust.

Implicit Links Identification (cont.)

• Specifically, we construct an implicit link from ui to uj if uifrequently gives a high rating to the items recommended by uj.

Cold Start Recommendation• we predict the possible rating , of an un-rated item im for a

cold start user un.

)(

)(

, )(

, )(

exp ),(

exp ,),(

0;1 exp ),(

0;1 exp ,),(

,

)()1(

)(ˆ

miRj m

miRj m

yn yiRj y

yn yiRj y

u jiR

u jmjjiR

u My u jiR

u My u jmjjiR

nmn r

uur

r

uuruu

)(maxarg)( ,1

ky cu

jyKk

j usigniR

Identify the cluster that is most closely related to item ij

Recommendation made by experts of imRecommendation made by experts of user profile

Experiment• We used the Epinions dataset to evaluate the performance of

the proposed system.

# of experienced users

# of cold start users

# of reviews

# of review ratings

# of trust relations

# of distrust relations

Avg. trust list length – experienced users

Avg. trust list length – cold start users

49,413

71,073

1,560,144

13,668,319

717,667

120,718

12.214

0.679

Statistics of the Epinions Dataset

Experiment (cont.)• We use the conventional leave-one-out procedure.

– For each cold start user, we evaluate the recommendation performance over multiple runs.

– Each evaluation run treats one rated review as an un-rated item and predicts a rating for it by using the information about the remaining rated reviews

• The evaluation metrics are the coverage rate, the mean absolute error (MAE), and the execution time.

The Effect of the Number of Clusters

The Effect of the Web of Distrust and ILI

The Effect of the Number of Experts

Comparison with Other Recommendation Methods

MAE Coverage Rate Execution Time (ms.)

Collaborative Filtering 0.68 3.20% 83.3

MoleTrust‐0 0.69 5.78% 13.5

MoleTrustMaven‐0 0.69 6.03% 21.9

MoleTrustFreq‐0 0.69 6.31% 22.0

MoleTrustConn‐0 0.69 6.14% 21.9

TidalTrust 0.75 34.68% 5873.1

Our method (|expk|=1) 0.71 19.95% 15.4

MAE Coverage Rate Execution Time (ms.)

MoleTrust‐2 0.75 26.63% 1145.3

MoleTrustMaven‐2 0.75 30.90% 2556.3

MoleTrustFreq‐2 0.75 31.16% 2202.3

MoleTrustConn‐2 0.75 31.23% 2846.2

TidalTrust 0.75 34.68% 5873.1

Our method (|expk|=4) 0.73 36.35% 16.9

Conclusions• Research on cold start recommendations is important because

retaining new users is the key to the success of e-services and e-commerce systems.

• We have proposed a cold start recommendation method that analyzes the social relationships of e-commerce users in order to identify trustworthy experts and derive useful recommendations for new users.

References• Chien Chin Chen and Yu-Hao Wan, “An Effective Cold Start

Recommendation Method Using Trust and Distrust Networks,” under revision at Information Sciences.

• Yu-Hao Wan and Chien Chin Chen, "An Effective Cold Start Recommendation Method Using a Web of Trust," in Proceedings the 15th Pacific Asia Conference on Information Systems (PACIS 2011), paper 205, 2011.

Business Cycle Surveillance Using the Query Logs of Search Engines

Introduction• Web search engines have become important tools that enable

people to acquire various types of information.

• Baeza-Yates and Tiberi’s (2007) assertion that, if people’s search activities on the Web are deliberate, then the query logs of search engines would be valuable sources of information to analyze social activities. – In other words, the query logs are effective sensors of social

activities.Baeza-Yates, R., & Tiberi, A., 2007. Extracting semantic relations from query logs, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. San Jose, California, USA, pp. 76-85.

Introduction (cont.)• In the past, query logs were the private property of search

engine companies …

• Now, it is possible to obtain the latest query information through a number of online Web services. For example, Google Insights.– http://www.google.com/insights/search

• Broadly speaking, query logs can be considered as a kind of Web 2.0 data.– Internet users can access query information freely and they can contribute their opinions

when performing Web searches.

Introduction (cont.)• Ginsberg et al. (2009) employed query logs for efficient

influenza surveillance. The proposed linear model outperforms other methods used to predict influenza outbreaks.

• Since then, many works used query logs to predict unemployment rates, consumption, retail sales, vehicle sales, real estate sales, and travel package sales.

J. Ginsberg, M.H. Mohebbi, R.S. Patel, L. Brammer, M.S. Smolinski, L. Brilliant, Detecting influenza epidemics using search engine query data, Nature 457 (2009) 1012–1014.

Introduction (cont.)• However … most approaches rely on experts to compile a set

of query terms.

• Can we identify representative query terms automatically?

• In this work, we utilize the query logs of search engines for business cycle modeling.– Instead of consulting domain experts, we propose a feature (term)

selection method to identify representative query terms from a large set of business articles.

Monitor Indicator• The Council for Economic Planning Department (CEPD) maintains an

indicator, called Monitor Indicator, which indicates the monthly status of Taiwan’s business cycle.– Five status classes: 紅燈 黃紅燈 綠燈 黃藍燈 藍燈

– There is a one-month lag in the release of all the business indices and indicators compiled by CEPD.

– the time-lag problem increases the uncertainty for governments and industries to establish appropriate business policies.

• We define business cycle surveillance as a supervised classification problem.– That is, given the query frequencies of representative query terms, we need to

predict the status class of business cycles.

The Business Cycle Surveillance System

Feature Selection• For each extracted noun wk, we treat its query frequencies Fk =

{fk,1, fk,2, …, fk,n} as a time series. – fk,j is the query frequency of wk downloaded from Google Insights at

time (month) j. – It is an integer in the range [0-100].

• Let M = {m1, m2, …, mn} be a set of Monitor Indicator values published by the CEPD.– mj is the Monitor Indicator value at time (month) j.– Its range is [9-45].

Feature Selection (cont.)• For each wk, we treat the pair <wk,xk> as a surveillance feature.

– The variable xk is a non-negative integer that determines the length of the lag in wk’s query frequencies.

• We calculate the representativeness of a feature <wk,xk> by the following equation:

n

xj kjn

xj kkxjk

n

xj kjkkxjk

kkkkk

kk k

k k

nxMmxnFf

nxMmxnFf

nxMxnFcorrxwrep

12

12

,

1 ,

]),1[(]),1[(

]),1[(]),1[(

]),1[],,1[(),(

Feature Selection (cont.)• If corr(Fk[1,n–xk],M[xk+1,n]) > 0,

– The surveillance feature <wk,xk> and the business cycle are positively correlated.

– Users tend to query wk xk months before the expansion of a business cycle.

• What about corr(Fk[1,n–xk],M[xk+1,n]) < 0 ??

• Both positively and negatively correlated features are useful for constructing a business status classifier in which rep(<wk,xk>) outputs the absolute correlation coefficient.

• We select the top K representative terms to construct a business status classifier for comparison.

Classification Model• We adopt the Naive Bayes model to construct our business

cycle surveillance system.

• Naive Bayes predicts the most likely business cycle status as follows:

)|(maxargˆ1

tltLl

OcsPc KxtKxtxtt fffO ,,2,1 ,...,,

21

frequency observation vector

K

k lxtklLl

ltlLl

cfPcP

cOPcP

k1 ,1

1

)|()(maxarg

)|()(maxarg

can be acquired from a training dataset

Data Sparseness Problem• Unfortunately, the Naive Bayes classification is affected by the sparseness of query

frequencies.

• We employ two popular data discretization methods, namely, the equal-width method (Han and Kamber, 2006) and the ChiMerge method (Kerber, 1992).

Experiment• We collected 60 of the CEPD’s Monitor Indicator status classes for a 5-year

period (2005 to 2009) and used them to construct and test the system.

• We downloaded 11,228 business-related articles published between 2005 and 2009 by the Liberty Times.

• To derive credible evaluation results, we utilize the leave-one-out cross validation method.

Number of evaluated business statuses 60

Number of news articles 11,228

Number of extracted terms 58,689

Number of candidate terms 6,224

Number of query frequencies 3,521,340

Experiment (cont.)The effect of K – the number of selected terms:

the setting includes the frequency lags for a three-month period (i.e., a quarter)

• The precision rate decreases as the number of selected features (i.e., K) increases.

• lag-012 is superior to lag-0 when K > 10.• Many of them are based on the same term and only differ in the length of the lag.

Experiment (cont.)The effect of discretization:

• Frequency discretization is very important for constructing a reliable business status classifier.

• To avoid the zero probability problem.

Experiment (cont.)Comparisons with other Feature Selection Methods:

• The terms selected by Freq may be too common to be used in discriminating the status of business cycles.

• The terms selected by PMI-IR are strongly associated with business cycles.• However, such terms are very specific, and they are rarely used to query business cycle information.

Query Term Analysis

wk xk rep pos./neg.

獎助金 (grant)  0 0.693 negative

買點 (buy in) 0 0.666 negative

大華 (Grand Cathay Securities Corporation) 0 0.644 positive

現增 (capital increased by cash) 0 0.630 positive

護盤 (stabilization) 0 0.619 negative

詳情 (inside information) 0 0.617 negative

救濟金 (relief fund) 0 0.614 negative

內需 (domestic demand) 0 0.611 negative

買賣 (trading) 0 0.598 positive

空缺 (vacancy) 0 0.595 negative

wk xk rep pos./neg.

獎助金(grant)  1 0.752 negative

促銷價 (discount) 2 0.746 negative

民政處 (Civil Affairs Department) 2 0.739 negative

民政處 (Civil Affairs Department) 1 0.736 negative

獎助金(grant) 2 0.734 negative

買點 (buy in) 1 0.731 negative

詳情 (inside information) 1 0.730 negative

合作案 (case) 1 0.725 negative

貸款 (loan) 1 0.724 negative

跡象 (sign) 1 0.723 negative

Features Selected under lag-0

Features Selected under lag-012

Conclusions• We have presented a novel business cycle surveillance system that utilizes

the query logs of search engines to predict the status of a business cycle.

• We have developed a feature selection method based on the correlation between query frequencies and business cycles.

• One advantage of the proposed business surveillance system is that query logs are readily available through online Web services;

– hence, the system can provide timely business cycle information to reduce the uncertainty inherent in the business policy decision-making process.

References• Chien Chin Chen and Yi-Tian Tsai, "A Novel Business Cycle

Surveillance System Using the Query Logs of Search Engines," Knowledge-Based Systems (KBS), Vol. 30, pp. 104-114, 2012.