Upload
mark-wang
View
144
Download
0
Embed Size (px)
Citation preview
…
Supported QueriesWordclouds for users of any group of subreddits for their comments everywhere.
Analysis of how often members of subreddits make comments in other subreddits.
Can filter by score of comments, activeness of users, specific users, and utc.
Sanders for gaming december 2014 Sanders for science december 2014
Pipeline
Elastic6 Nodes
Cassandra and Kafka4 Nodes
Flask1 Node
Database Design
Only store important information and remove stopwords.Cut raw json data by ~70%.Also, aggregation pulls entire data fieldinto RAM, so had to break table into months.
Elastic aggregation is estimation, store true data inCassandra if perfect accuracy or full text is needed.
Schema:((user, year_month), utc, subreddit, post_id, {word: frequency}, full_text)
Challenges Encountered
Step 1
Almost all deals with ES.
Very powerful, but indexing can be resource intensive. Had to up ES_CACHE to make aggregations work due to kink.
Spark did not play well with ES. Kept crashing when trying
to upload directly. Separated into two parts.
Performance
Throughput: Using 4m.xlarges, can process and upload
~12G of data in ~30 min. Reddit only has ~1G of data a day
ES: For the earlier months, very fast (<500ms). For later
months, when data ~30gig raw, 8 gig in DB, searches
can take > 30 seconds. Faster for subsequent searches.
Also, found ES is xx% accurate.