Yelp Fake Review Detection - · PDF fileYelp Fake Review Detection Mo Zhou, Chen Wen, Dhruv Kuchhal, Duo Chen Electrical Engineering Department Columbia University fmz2417, cw2758,

Yelp Fake Review Detection

Mo Zhou, Chen Wen, Dhruv Kuchhal, Duo ChenElectrical Engineering Department

Columbia University{mz2417, cw2758, dk2814, dc3026}@columbia.edu

Abstract— In this work we study and segregate the deceptiveopinion spam and the honest reviews from each other usingthe dataset provided by Yelp. Currently the opinion spamdetection is supervised and relies on heuristics. We haveproposed an unsupervised learning algorithm for opinionspam detection. The Yelp dataset includes both ratings andtext reviews of the businesses. As a novel contribution, weconstructed our approach by looking for anomalies bothin the ratings and the review fields. We made use of theapproach that a specific review having a large deviation interms of rating from the other reviews implies a potential spam.

Keywords - Fake reviews, opinion spam, rating similarity,cosine similarity, review similarity

I. INTRODUCTION

These days with the boom of e-commerce, product orbusiness reviews have become an essential input to a con-sumer to make a decision for shopping, dining, travelingand other activities. He or she is able to make a wellinformed decision of the quality of the product, warrantyinformation, the services it offers etc. by reading the reviewsand looking at the ratings online on business aggregatorwebsites like Yelp or e-commerce websites like Amazonor Ebay. Therefore the magnitude of honesty goes a longa way to making an informed decision. Looking at thevalue of these reviews and ratings, some businesses employpeople to write fake reviews about themselves or about acompeting product or business to drive the sales towardsthemselves and gain an unfair advantage by perpetrating falseinformation and deliberately misleading the consumers. Thisprocess of writing fake reviews is called Opinion Spamming.As increasingly businesses are using the online reviews todrive their sales, detecting these opinionated fake reviewsbecomes more and more crucial.

In this paper we propose a novel method to find opinionspammers with Yelp business ratings and reviews. A uniquechallenge to accomplish this goal is the lack of ground truthin our scenario, where we do not have the resources tohire domain experts or crowd-source on Amazon MechanicalTurk to establish referenced genuine reviews as opposedto fake ones. Nevertheless, we introduce a system that canautomatically spot potential opinion spammers by comparingtheir rating similarities and review similarities to the averageof all other users, as we assume most of the ratings andreviews on Yelp.com are from valid sources. This approx-imation proves to be reasonable once we show the results.Another motivation behind this approach is that malicious

users are usually receiving incentives to post positive reviewsfor certain businesses, where the specific restaurants sufferfrom low ratings or bad reviews, hence the rating and reviewcontents from the fake reviewers tend to deviate significantlyfrom true customers.

The rest of the paper is organized as follows. Section IIprovides an overview of related works and the backgroundliterature review that we did before embarking upon theproject. In Section III we describe the detailed steps ofour proposed system for fake review detection. Section IVprovides details of our algorithm and platform we use, andSection V introduces our open-sourced software package. InSection VI, we discuss our experimental results and evaluatethe system performance. Finally, Section VII concludes thepaper.

II. LITERATURE REVIEW

The problem of opinion spamming was first studied in2008 [2] and the interest in the study has peaked ever since.Several different problems related to opinion spamming havebeen studied ranging from (but not limited to) individualopinion spammers to spamming groups [1]. There have beenstudies in which the timeline of the reviews have beenanalysed and inferences drawn [3]. There have been twomain successful approaches in the past to detect the fakereviews:

1) Use Amazon Mechanical Turk to collect fake reviews2) Segregate the fake reviewers into identified groupsThe use of first approach can be found in [4]. This method

is easy to implement. Nevertheless this technique cannotmimic the actual climate of opinion spamming. Incentive-based fake reviews are different from crowd-sourced fakereviews, where random people merely write easy-to-spot fakereviews as a result. Second method, which is proposed in[1], particularly showed that the spammer groups can be farmore damaging to a business than an individual spammer,and thus can take a quasi control over the sentiment towardsa particular product. It shows that is difficult to use both thecontent based detection and segregation and behavior baseddetection because any member can choose not to behaveabnormally any time. It also indicates that the reviewergroups have a few traits in common like the commonality ofthe product and the time window in which it is reviewed.

Both of the above processes are expensive and industri-ous. Another novel unsupervised approach utilising Bayesianclustering to model the ”spamicity” (degree of spam) of the

reviews has been proposed [5]. It utilises behavioral and lin-guistic modelling of the reviewers to generate a Latent SpamModel to identify fake reviews. The behavioral differencesof the reviewers can create a margin between the spam andthe non-spam which the LSM aims to learn and exploit todistinguish the fake from the rest. It also provides an insightinto the linguistic characteristic of deceptive opinions.

Start

Input

Data Preprocessing

Secondary Input

Main Algorithm

Output

Stop

III. SYSTEM OVERVIEW

The dataset we download from Yelp Dataset Challengeincludes data from Phoenix, Las Vegas, Madison, Waterlooand Edinburgh, with 42,153 businesses and 1,125,458 re-views. The size and the richness of the data has promptedus to devise a system to handle the data first and then feedpreprocessed data into main algorithm. The raw data has thefollowing format [6]:• Business

’type’: ’business’, ’business id’: (encrypted businessid), ’name’: (business name), ’neighborhoods’: [(hoodnames)], ’full address’: (localized address), ’city’:(city), ’state’: (state), ’latitude’: latitude, ’longitude’:longitude, ’stars’: (star rating, rounded to half-stars),’review count’: review count, ’categories’: [(localizedcategory names)] ’open’: True / False (corresponds toclosed, not business hours), ’hours’: {(day of week):

{’open’: (HH:MM),’close’: (HH:MM)},...}, ’attributes’:{ (attribute name): (attribute value),...}

• Review’type’: ’review’, ’business id’: (encrypted business id),’user id’: (encrypted user id), ’stars’: (star rating,rounded to half-stars), ’text’: (review text), ’date’: (date,formatted like ’2012-03-14’), ’votes’: {(vote type):(count)}

• User’type’: ’user’, ’user id’: (encrypted user id), ’name’:(first name), ’review count’: (review count), ’aver-age stars’: (floating point average, like 4.31), ’votes’:{(vote type): (count)}, ’friends’: [(friend user ids)],’elite’: [(years elite)], ’yelping since’: (date, formattedlike ’2012-03’), ’compliments’: {(compliment type):(num compliments of this type),...}, ’fans’: (num fans)

A flowchart of how our system operates is provided onthe left. Our process goes like this: first we input our rawYelp data into Python and parse the rating and review datafield corresponding to each unique user id. Then we takethis cleaned data and feed into our main algorithm, whichis implemented in Java with Eclipse. The details of thealgorithm is provided in the next section. The output fromthe main shema will then be plotted and shown.

IV. ALGORITHM

A. Preprocessing

The preprocessing part of our approach follows the follow-ing steps, where the input of the procedure is the raw datasetacquired from Yelp Dataset Challenge, and the output of thisstage is served as secondary input to main algorithm.

1) Create a dictionary using the business dataset, oneencrypted business id for one actual business name.

2) Go through all the rows in the review dataset, grab thereview text, ratings and their corresponding businessid (encrypted).

3) Translate the business id by looking them up at thedictionary created in step one and output the reviewsand ratings along with their actual business name. Forthe ratings, one txt file for each business is createdand all the ratings are listed and sorted with numbers.For the reviews, one document file for each business iscreated where several txt files are listed for each singlereview text with sorted number attached as well.

The seemingly duplicate business names are caused bythe dataset itself since these similar businesses actually havecompletely different encrypted business id. So nothing isdone on it since we want it the original way.

B. Main Algorithm

We first provide our mathematical construction of ourapproach. As a novel proposal, the rating similarity takes intoaccount the deviation of user rating and normalizes accordingto the maximum rating. The review similarity adopts standardcosine similarity algorithm.

• Let r1,r2, . . . ,rn be the rating of user n of the samebusiness, with a maximum rating of 5, and let N be thenumber of users, then

rSimn = 1−∑

Ni=0,k=0,i6=k(ri− rk)/5

N∈ [0,1]

denotes the average rating similarity score between usern and all other users, limited for the same business

• Let c1,c2, . . . ,cn be the cosine similarity between reviewposted by user n and all other users of the same business,and let N be the number of users, then

cSimn =∑

Ni=0,k=0,i6=k ci× ck√

∑Ni=0 c2

i ×√

∑Nk=0 c2

k

∈ [0,1]

denotes the average review similarity score between usern and all other users of the same business

Both similarity scores are between 0 and 1, with a largerscore indicating a higher similarity. We read several reviewswith extreme low value of review similarity, and found theyexhibit high suspicion of opinion spamming. For example,one reviewer of a pizza diner has a cSim of 0.028, whichis way below all other cSimn, talked about how they havethe best beer. As a frequent pizza eater myself, I find thisunusual for someone to find the beer offering of a smallpizza place could be ’Incredible’ or ’Amazing’, as those arethe words in the text. This again reinforced our belief thatan extremely low average cosine similarity could potentiallybe the result of a fake review. Nevertheless, we do not thinkthat detrimental opinion spamming stops here. One wouldhave to rate the business extremely high or low to servethe purpose. Hence we implemented rating similarity as thecomplementary part in our approach cluster both extremelylow cSimn and rSimn, then you have a list of fake reviewsas well as the users who posted them.

V. SOFTWARE PACKAGE DESCRIPTION

The preprocessing part is implemented in Python, providedin the extract new.py package. Both parts of the mainalgorithm is implemented in Java with Eclipse. We wrote4 Java classes in Eclipse. Cosine Similarity. java is the mainclass, where we use vectorized text reviews as input andimplemented standard cosine similarity algorithm to findcSimn between one specific user review and all other reviewsfor the same business. Rating Similarity. java deals withthe first part of our main algorithm, which is a straight-forward implementation of the mathematical expression weprovide in the previous section. GetDir. java is implementedto facilitate our data extraction process, as we have over 1million reviews and ratings as well as 40,000 businesses.keyPair. java is another trick we implement that to expediteour computation process, as we need to be able to match therating similarity and the review similarity of the same userin the output. We compute the output in a csv file and importit into Excel. We visualize our results using 2D scatter plot.We have records of the user, rating, review and business foreach point, and we generate a list of fake review users.

Fig. 1. Data Preprocessing Example

Fig. 2. Main Algorithm Example

A screenshot of data parsing in Python is provided inFigure 1. Some UI in Eclipse are provided in Figure 2.

VI. EXPERIMENTAL RESULTS

An overview of our results are provided in Figure 3. Dueto the size of the output file and number of data points, we areunable to plot all one users on the graph. Nonetheless, we canstill see the overall distribution of our result. The majorityof users have high rating similarity score and decent reviewsimilarity score. which substantiates our hypothesis that mostratings and reviews are genuine. The deviants at bottom leftcorner are users of interest. According to Yelp.com, itdeploys its own spam filter which may result in the lack oftrue negatives. Nevertheless, we still obtain several hundredpotential fake review candidates, as provided in Figure 4.

Figure 4 is a complete set of results we obtain, whichare the potential opinion spammers out of over one millionusers. The threshold setting for rSim and cSim is currently0.2 and 0.3, which is a good assumption based on the overalldistribution we obtained in Figure 3. Part of our futurework will be optimizing the cut-off value of rSim and cSim,which will need domain experts to establish reference levelof ground truth. For this project, we are satisfied with ourresults, as the anomalies in the ratings and reviews cannot beoverlooked. These reviewers are highly likely posting fakeratings and reviews for cash interests.

VII. CONCLUSION

In this paper, we propose a novel automatic procedureto flag opinion spammers on rating websites. Our systempreprocesses the data and feed the valuable fields into themain system, which then compares each user’s rating andreview to the average of all other users, for the same

Fig. 3. Result Overview

Fig. 4. Opinion Spam Candidates

restaurant business. Using the complete set of data from YelpDataset Challenge, we are able to locate several hundredpotential fake review posters among one million active users.

The individual contribution to this project is as follows:Dhruv proposed the dataset we currently use and did theliterature review of the topic; Duo was responsible of parsingraw data and cleaning them up in Python; Mo and Chenimplemented the main algorithm in Java using Eclipse.

In future work, we will incorporate more feature extractioninto spam signals, including time window of the posted re-views and user relationships. As mentioned in the LiteratureReview Section, opinion spammers are likely to work in agroup to post positive reviews for certain restaurant, whichwill be revealed once we obtain their connections as wellas the posting time. An innocuous user is highly unlikely toreview a large number of restaurants in a few minutes.

VIII. ACKNOWLEDGMENT

The authors would like to thank Professor Ching-Yung Linfor provding a soon-to-be-published paper of the same topic,which is of great help during the course of our project.

REFERENCES

[1] A. Mukherjee et al., Spotting Fake Reviewer Groups in ConsumerReviews, WWW 2012, ACM 978-1-4503-1229-5/12/04.

[2] N. Jindal et al., Opinion Spam and Analysis, Proceedings of the Inter-national Conference on Web search and web data mining (WSDM).,2008

[3] S. Xie et al., Review Spam Detection via Time Series Pattern Discov-ery, Proceedings of the ACM conference on Knowledge Discoveryand Data Mining (KDD), 2012

[4] M. Ott et al., Finding Deceptive Opinion Spam by Any Stretch of theImagination, Association of Computational Linguistics (ACL), 2011

[5] A. Mukherjee et al., Opinion Spam Detection: An UnsupervisedApproach using Generative Models, University of Houston, 2014

[6] Yelp Dataset Challenge, http://www.yelp.com/dataset_challenge

Documents

Yelp Fake Review Detection - · PDF fileYelp Fake Review Detection Mo Zhou, Chen Wen, Dhruv Kuchhal, Duo Chen Electrical Engineering Department Columbia University fmz2417, cw2758,