9

reddit genie

Embed Size (px)

Citation preview

Page 1: reddit genie

Page 2: reddit genie
Page 3: reddit genie
Page 4: reddit genie

Supported QueriesWordclouds for users of any group of subreddits for their comments everywhere.

Analysis of how often members of subreddits make comments in other subreddits.

Can filter by score of comments, activeness of users, specific users, and utc.

Sanders for gaming december 2014 Sanders for science december 2014

Page 5: reddit genie

Pipeline

Elastic6 Nodes

Cassandra and Kafka4 Nodes

Flask1 Node

Page 6: reddit genie

Database Design

Only store important information and remove stopwords.Cut raw json data by ~70%.Also, aggregation pulls entire data fieldinto RAM, so had to break table into months.

Elastic aggregation is estimation, store true data inCassandra if perfect accuracy or full text is needed.

Schema:((user, year_month), utc, subreddit, post_id, {word: frequency}, full_text)

Page 7: reddit genie

Challenges Encountered

Step 1

Almost all deals with ES.

Very powerful, but indexing can be resource intensive. Had to up ES_CACHE to make aggregations work due to kink.

Spark did not play well with ES. Kept crashing when trying

to upload directly. Separated into two parts.

Page 8: reddit genie

Performance

Throughput: Using 4m.xlarges, can process and upload

~12G of data in ~30 min. Reddit only has ~1G of data a day

ES: For the earlier months, very fast (<500ms). For later

months, when data ~30gig raw, 8 gig in DB, searches

can take > 30 seconds. Faster for subsequent searches.

Also, found ES is xx% accurate.

Page 9: reddit genie