Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Spark Tutorial for Text Analysis
Sunnie Ching
CIS612 Big Data and Parallel Data Processing
Aspect Based Opinion Mining of User-Product Reviews:
The following data set was used for this experiment: from the website:
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
Downloaded the Additional Customer review datasets used by Ding, Liu and Yu, WSDM-2008.
This contained a dataset of Ipod Reviews roughly 1000 reviews. The file was annotated with a bunch of
brackets and hashtag signs for NLP processing. These were manually removed
using a semi-supervised approach, by feeding some feature set of import product features into the
model, before applying the LDA clustering technique to get actual term weights, a script was written in
python: scrap.py to scrape: the Classic IPod review site at Amazon.com:
This takes the URL:
https://www.amazon.com/Apple-MC297LL-Generation-Discontinued-
Manufacturer/dp/B001F7AHOG/ref=sr_1_1?s=mp3&ie=UTF8&qid=1492750330&sr=1-1
and creates three variables to get different product features located in different areas of the
amazon.com Ipod review site web page, scraping them by their XPATH locations.
Finally a csv file is created in the directory of the python script which contains the rows of scraped
product features. :
We manually extracted individual 7 features, based on words that matched in the IpodReviews.txt file
downloaded from Professor Bing Liu’s website.
Keeping with the semi-supervised method, the data set of reviews were initially filtered using the
feature words obtained by this scraping method, below is an example of the methodology in Apache
spark. The RDD reviews, is filtered by comparing each review to each product aspect feature in the list, if
a tweet contains this aspect term, then it is kept for clustering, if not, the tweet is removed from the
data set.
The reviews are once again preprocessed to remove non-alphanumeric characters preparing them for
sentiment analysis on important term features.
Output: k=1 Top Features: battery, case, sound, features
k=2 top features battery and sound