Click here to load reader
Upload
simon-roy
View
162
Download
0
Embed Size (px)
DESCRIPTION
Come find out the good parts and the not so good parts which R brings to the table when it comes to number crunching
Citation preview
Dancing with R
Expectations
✓ A possible data model using the R language when you have to query data over time ranges.
✓ The super cool libraries on your must list for number crunching.
✓ Language/libraries/ecosystem gotchas.
✓ Scaling and performance tuning.
✓ Tools, IDEs and deployment
✗ A comparison of different languages / frameworks which give you number crunching
Problem Space
Weekly metrics (in and out and planning data)
Each week ~ 1M rows(out), 50K rows(in), 10K rows(planning)
Query over 1 to 52 weeks (102 weeks cover in the worst case)
Entity Hierarchy (rollup metrics and entity level trends over weeks)
Variation and normalized based analysis.
Problem Space
…ContinuedMetrics Grouping
First Degree metrics(rollup at the a level higher) – 40%Second Degree metrics(rollup over a filtered aggregated set) – 40%Third Degree metrics(rollups which include comparison and further processing) – 20%Apply multiple hierarchies
2 * ∑(r=1..52) ((52-r) + 1)Forget precomputation!!
Design, Partitioning
Data partitioned into facts and dimensions. Facts are weekly
Lightning fast, mostly filter and loose bigger data dimensions.
Low on complexity but high on input space.
Primary reduces the input space as much as possible.
Dimensions are the real number churners. Collects and Aggregates across queried year weeks.
Enriches with meta and other hierarchical data where needed.
Does the more complicated number crunching.
High on complexity but relatively lower on input space when compared to the facts.
Fork and Join
Multiple Dimension instances
One fact instance vs Facts within dimensionsF(m
)
F(n)
D1
F(1)
The #cool libraries
Must have!!
Data.table
The R God of number crunching
Multicore
Just no words!
Just be careful with upgrades!
RJSONIO
JSON serialization and deserialization. Super fast!
Rcache – Memoization!!
Your real life saver!
Functional/plyr – for all your dymanic functional needs – personal preference.
test_that
Bless your code before it takes birth
What you are stuck with
Rook
HTTP web server, which was intented to be a help doc hosting server.
Hopefully not inspired “Rook“ in chess!
Still in the 90’s. Make sure your URLs are compliant.
Rcurl
HTTP client
More fun to come on this!
As a Service
Single threaded.
Minimum HTTP support.
You have to live with rook.
Multicore support.
More domain specific.
Data is read only and R has all the right tools for you to exploit this.
Yea Yea the dance!The Other Face
Know your code!
Smallest things get amplified!
Don’t leave things to chance!
Upgrades yikes!!
Fix your library versions!
R upgrade, make sure you pray! And take your unit tests seriously. They normally speak in riddles ;)
Clean your own backyard! The GC of R is garbage at times.
Some homegrown Tools
Sublime Awesomeness!!
Osiris
Donatello
GoToTest
Rmake
Rmocks
Sane
Look in github for: jpsimonroy or ashokgowtham
Testing
Functional testing – cucumber – Sounds very
interesting.
Test data setup!
Customize startup
Memory Management to run on dev machines.
Make your tests run faster.
Performance testing!
Our World!
Data Sync / Deploy! Build and tools! Test!
?