A Parallel R Framework - Data-Intensive Distributed...

A Parallel R Framework for Processing Large Dataset on Distributed Systems

Nov. 17, 2013

This work is initiated and supported by Huawei Technologies

Rise of Data-Intensive Analytics Data Sources

Personal data for the internet • query history • click stream logging • tweets • …

Machine Generated Data • Sensor networks • Genome sequencing • Physics experiments • Satellite imaging data

New Challenges

Huge demand for data analysts, statisticians & data scientists

Traditional tools work on summary or sampled data

A good tool for large-scale data: • Usability: stick to traditional semantics • Performance: distributed parallelism • Fault Tolerance: MapReduce

GB SAS

R used by about 30% of data analysts Why R: • Data structures • Functional language • Rich functionality • Graphic visualization • Open source

Why not R: • Single threaded • limited memory

Usability

Survey by Revolution Analytics

http://r4stats.com/articles/popularity/

Performance - Spark Framework Distributed Computing with Fault-tolerance

Developed at AMP lab, UC Berkeley

Flexible Programming Model • DAG job scheduler

Performance • In-memory • Good for iterative algorithms

Resilient Data Sets • Recover from loss and failures

RABID Package Structure

R User Analytics Application

HDFS/HBase

Optimizer, Scheduler Sto

Lower-level Ops

DM functions

Matrix Ops RABID as an R extension package

Runtime Overview

Spark Scheduler, optimizer

Fault-tolerant

R worker

Master

Slave Slave

Data shuffle

R scripts

R worker

Worker

Web server R driver session Symbol tables

Programming Model

R List – Most general data structure • Collection to store elements of any & different types • similar to Python tuples • Very general and flexible

BigList – distributed list structure • Extended R List to be distributed • Override R list functions to support BigList • Building blocks for higher level structures and functions

RABID Example (1) Sample APIs

library("Rabid")

DATA <- rb.readLines(“hdfs://…”)

DATA <- lapply(DATA, as.numeric, cache=TRUE)

centroids <- as.list(sample(DATA, 16))

func <- function(a) { ... } DATA2 <- lapply(DATA, func)

Load the Rabid package

Read text into a BigList

Transform back to R list

Apply function in parallel

DATA3 <- aggregate(DATA2, by=id, FUN=mean)

Apply UDF in parallel

Aggregate by user specified keys

RABID Example (2) Sample APIs

library("Rabid") mat <- rb.read.matrix(“hdfs://…”, cache=F) mat1 <- mat + mat mat2 <- mat %*% mat t(mat) cor(mat)

Data Blocking for lapply()

Slave 1

Slave 2 …

Slave 1

Slave 2

lapply

Text input

Blocked in R list

… An HDFS block

A Hadoop Split

Further block the data for better efficiency of transferring and processing

Data Blocking for aggregate()

Slave 1

Slave 2

Slave 1

Slave 2

Data shuffling

User’s key + aggregated values

hash key 1

hash key 2

hash key 1

hash key 2

hash key 1

hash key 2

Slave 1

Slave 2

aggregate

Distributing Computation

Computations are abstracted as R functions, which are serialized to the nodes and evaluated

R has a scoping rule for searching free variables in an enclosing environment

We need to ship the functions together with the values of free variables in its environment

z <- 1 func1 <- function() { y <- 2 func2 <- function(x) { x + y + z } }

Merging Deferred Operations

Two major overheads of each RABID operation:

1) data transferring

2) serialization/deserialization

Merge adjacent deferred operations into one reduces the overheads

What kind of operations can be merged: non-aggregation

Merging Deferred Operations

f(x) f(y)

g(a, b)

g[f(x), f(y)]

Fault Tolerance

Take the advantage of Spark’s fault tolerance feature at the worker side

Detect user code errors that terminate R worker sessions; catch the error and stop Spark job immediately

Zookeeper at the master side

Applications & Benchmarking

Logistic Regression: • 10 worker nodes • 1 ~ 100 million records • RABID uses 1/6 LOC of Hadoop

Movie Clustering (K-Means): • 10 worker nodes • 11 ~ 90 million ratings • RABID uses 1/8 LOC of Hadoop

Compare RABID with Hadoop & RHIPE • RHIPE: R and Hadoop Integrated Programming Environment, developed at Purdue University

Logistic Regression Runtime over Data Size

Hadoop

K-Means Movie Clustering Runtime

Hadoop

Logistic Regression Runtime over # nodes

Logistic Regression Runtime on iterations

Conclusions

RABID provides R users with a familiar programming model that scales to large cloud based clusters, allowing larger problems sizes to be efficiently solved.

Preliminary results show RABID outperforms Hadoop and RHIPE on our benchmarks

RABID is cloud-ready to be used as a service

Future Work

Optimizing data transferring between R session and Spark

Trade-off between fault tolerance and performance

Benchmarking more applications and at a larger scale

Thank You!

haolin@purdue.edu

shuo.yang@huawei.com

smidkiff@purdue.edu

Nov 17, 2013

Also thank Prof. Michael Franklin and Matei Zaharia at UC Berkeley, for discussing the ideas over Spark project

A Parallel R Framework - Data-Intensive Distributed...

Documents

Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

ACM ScienceCloud 2013 2 - Data-Intensive Distributed ...datasys.cs.iit.edu/events/ScienceCloud2013/concluding-remarks.pdf · ACM ScienceCloud 2013 10 . ... ternatives to purchasing

BigDataX: From theory to practice in Big Data computing at ...datasys.cs.iit.edu/grants/BigDataX/bigdatax.pdfschool, and career clarification and confirmation. A typical summer schedule

Albatross: an Efficient Cloud-enabled Task Scheduling and ...datasys.cs.iit.edu/publications/2016_eScience16-albatross.pdf · tasks. Further, to achieve high performance, Albatross

Data assim r

Chapter 5: The Data Link Layer r Application r Transport r Network r data link layer service m Moving data between nearby network elements Move data between

IStore: Towards High Efﬁciency, Performance, and Reliability in …datasys.cs.iit.edu/reports/2013_IIT-TR_IStore.pdf · 2013-02-01 · IStore: Towards High Efﬁciency, Performance,

Distributed Storage Systems for Extreme Scale Data ...datasys.cs.iit.edu/reports/2013_Cyberbridges2013-Poster.pdfThe A I AREER Workshop is a great start Running this annually will

Towards Supporting Data-Intensive Scientific …datasys.cs.iit.edu/publications/2014_IIT_PhD-proposal_Dongfang... · Applications on Extreme-Scale High-Performance Computing Systems

BIG DATA SYSTEM INFRASTRUCTURE AT EXTREME SCALES …datasys.cs.iit.edu/publications/2015_IIT_PhD-thesis_Dongfang-Zhao.pdfBIG DATA SYSTEM INFRASTRUCTURE AT EXTREME SCALES BY DONGFANG

Illinois Institute of Technology – SC15 Student Cluster ...datasys.cs.iit.edu › events › SCC-SC15 › scc-sc15-poster.pdf · visualization as well as scientific visulization

CiteSearcher: A Google Scholar frontend for Mobile …datasys.cs.iit.edu/reports/2012_IIT_CiteSearcherPaper.pdf3.1 Google Scholar The biggest challenge CiteSearcher had was how to

StorkCloud: Data Transfer Scheduling and Optimization …datasys.cs.iit.edu/events/ScienceCloud2013/p04.pdf · StorkCloud: Data Transfer Scheduling and Optimization as a Service

Towards Next Generation Resource Management at Extreme …datasys.cs.iit.edu › publications › 2014_IIT_PhD-proposal_Ke-Wang.pdffirst step is to understand the design choices and

Data Manipulation, R

High-performance Storage Support for Scientiﬁc Big Data ...datasys.cs.iit.edu/publications/2016_Storage_Book...a design is due to the workload characteristic in data centers. More

R Data Visualization-Spatial data and Maps in R: Using R as a GIS

Introduction to Data Analysis in R - Module 2: Data ...Introduction to Data Analysis in R Andrew Proctor Intro Packages in R Data Prep Preliminaries Data Preparation Cleaning data

Migrating Scientific Workflow Management Systems from the …datasys.cs.iit.edu/publications/2014_Cloud4Data_Swift... · 2019-10-14 · open source Cloud platforms, such as Hadoop

A Data Throughput Prediction and Optimization Service for ...datasys.cs.iit.edu/events/MTAGS09/a04-yin-slides.pdf · size from louie1 to painter1 in LONI network . This measures the