13
More than a reddit N-gram viewer JERRY PRAWIHARJO INSIGHT DATA ENGINEERING FELLOW nerddi t

Nerddit Demo Presentation

Embed Size (px)

Citation preview

Page 1: Nerddit Demo Presentation

More than a reddit N-gram viewer

JERRY PRAWIHARJOINSIGHT DATA ENGINEERING FELLOW

nerddit

Page 2: Nerddit Demo Presentation

MotivationN-grams:Allows Data Scientists to do Topic trends analysis, language analysisAllows for “Type-ahead” feature

Subreddits network graph:

SR1

SR2

SR3U

U

U

U

Page 3: Nerddit Demo Presentation

1-gram

My Name is Jerry“My” “Name” “is” “Jerry”

Page 4: Nerddit Demo Presentation

2-grams

My name is Jerry“My Name”

“name is”

“is Jerry”

Page 5: Nerddit Demo Presentation

3-grams

My name is Jerry“My name is”

“name is Jerry”

Page 6: Nerddit Demo Presentation

Pipeline

6x m4.xlarge$1.43/hour

5x m4.xlarge$28.7/day

t2.microfree

~10GB10/2007-12/2015

>1TB uncompressed

4x m4.large$11.5/day

Page 7: Nerddit Demo Presentation

Reddit Statistics

Year Date comments Unique authors Unique subreddits2015 2015-12-01 10000 25000 600002014 2014-12-01 50000 35000 40000

Page 8: Nerddit Demo Presentation

N-gram

Ngram Date N Count PercentageHallows 2011-04 1 10 0.1Deathly Hallows 2011-04 2 50 0.1

Ngram N Subreddit Count (counter type)Hallows 1 movies 1000Deathly Hallows 2 movies 5000

N gram cluster against subredditsTime series ngrams

word-parser

(“2011-04”, [“old”, “lady”, .., “Deathly”, “Hallows”,…], “movies”])

(“2011-04::old::movies”, 1)(“2011-04::lady::movies”, 1)…(“2011-04::Deathly::movies”, 1)

(“2011-04::old::movies”, 10)(“2011-04::lady::movies”, 5)…(“2011-04::Deathly::movies”, 2)Job took ~2days to complete

Regex filtersURLs, IMG links, unicodes

Page 9: Nerddit Demo Presentation

Subreddits Graph

Year node1 node22011 movies {politics: 10, games:5,…}2014 politics {games: 3,conservative: 2,…}

Year Distinct authors subreddit Comments2011 TheOceldoc movies 1002011 JohnDoe politics 200

(TheOceldoc, (movies, politics, games,…)(JohnDoe, (politics, conservative,…)

(“movies::politics”,10)(“movies::games”,5)(“politics::games”,3)…(“politics::conservative”,2)

Edge weight

Filter degree < 100ClusteringForce Atlas 2 layout

Page 10: Nerddit Demo Presentation

Spark Tuning

A B C D0

2

4

6

8

10

12

14

16

18

Case

Tim

e (m

inut

es)

Case Rdd Compress KryoA FALSE FALSEB TRUE FALSEC TRUE TRUED FALSE TRUE

Page 11: Nerddit Demo Presentation

Jerry Prawiharjo Phd in Optoelectronics from Southampton England

◦ Distributed computation on Beowulf cluster (MPI)

Product Development Engineer at Neophotonics◦ Test software development and data analysis

Senior Test Development Engineer at Cisco◦ Test station development (hardware and software) for 100G transceiver module

Page 12: Nerddit Demo Presentation

Back Up

Page 13: Nerddit Demo Presentation

Challenges Sheer amount of Data: >1TB

◦ Scoping the project: monthly time bucket (as opposed to daily or weekly)◦ Filter foreign language subreddits◦ Spark tuning

S3 rate limit: Process data on file-per-file basis