16

Click here to load reader

Dancing in R

Embed Size (px)

DESCRIPTION

Come find out the good parts and the not so good parts which R brings to the table when it comes to number crunching

Citation preview

Page 1: Dancing in R

Dancing with R

Page 2: Dancing in R

Expectations

✓ A possible data model using the R language when you have to query data over time ranges.

✓ The super cool libraries on your must list for number crunching.

✓ Language/libraries/ecosystem gotchas.

✓ Scaling and performance tuning.

✓ Tools, IDEs and deployment

✗ A comparison of different languages / frameworks which give you number crunching

Page 3: Dancing in R

Problem Space

Weekly metrics (in and out and planning data)

Each week ~ 1M rows(out), 50K rows(in), 10K rows(planning)

Query over 1 to 52 weeks (102 weeks cover in the worst case)

Entity Hierarchy (rollup metrics and entity level trends over weeks)

Variation and normalized based analysis.

Page 4: Dancing in R

Problem Space

…ContinuedMetrics Grouping

First Degree metrics(rollup at the a level higher) – 40%Second Degree metrics(rollup over a filtered aggregated set) – 40%Third Degree metrics(rollups which include comparison and further processing) – 20%Apply multiple hierarchies

2 * ∑(r=1..52) ((52-r) + 1)Forget precomputation!!

Page 5: Dancing in R

Design, Partitioning

Data partitioned into facts and dimensions. Facts are weekly

Lightning fast, mostly filter and loose bigger data dimensions.

Low on complexity but high on input space.

Primary reduces the input space as much as possible.

Dimensions are the real number churners. Collects and Aggregates across queried year weeks.

Enriches with meta and other hierarchical data where needed.

Does the more complicated number crunching.

High on complexity but relatively lower on input space when compared to the facts.

Page 6: Dancing in R

Fork and Join

Multiple Dimension instances

One fact instance vs Facts within dimensionsF(m

)

F(n)

D1

F(1)

Page 7: Dancing in R

The #cool libraries

Page 8: Dancing in R

Must have!!

Data.table

The R God of number crunching

Multicore

Just no words!

Just be careful with upgrades!

RJSONIO

JSON serialization and deserialization. Super fast!

Rcache – Memoization!!

Your real life saver!

Functional/plyr – for all your dymanic functional needs – personal preference.

test_that

Bless your code before it takes birth

Page 9: Dancing in R

What you are stuck with

Rook

HTTP web server, which was intented to be a help doc hosting server.

Hopefully not inspired “Rook“ in chess!

Still in the 90’s. Make sure your URLs are compliant.

Rcurl

HTTP client

More fun to come on this!

Page 10: Dancing in R

As a Service

Single threaded.

Minimum HTTP support.

You have to live with rook.

Multicore support.

More domain specific.

Data is read only and R has all the right tools for you to exploit this.

Page 11: Dancing in R

Yea Yea the dance!The Other Face

Page 12: Dancing in R

Know your code!

Smallest things get amplified!

Don’t leave things to chance!

Upgrades yikes!!

Fix your library versions!

R upgrade, make sure you pray! And take your unit tests seriously. They normally speak in riddles ;)

Clean your own backyard! The GC of R is garbage at times.

Page 13: Dancing in R

Some homegrown Tools

Sublime Awesomeness!!

Osiris

Donatello

GoToTest

Rmake

Rmocks

Sane

Look in github for: jpsimonroy or ashokgowtham

Page 14: Dancing in R

Testing

Functional testing – cucumber – Sounds very

interesting.

Test data setup!

Customize startup

Memory Management to run on dev machines.

Make your tests run faster.

Performance testing!

Page 15: Dancing in R

Our World!

Data Sync / Deploy! Build and tools! Test!

Page 16: Dancing in R

?