23
Introduction to SparkR Shivaram Venkataraman, Hossein Falaki

Use r tutorial part1, introduction to sparkr

Embed Size (px)

Citation preview

Page 1: Use r tutorial part1, introduction to sparkr

Introduction to SparkR

Shivaram Venkataraman, Hossein Falaki

Page 2: Use r tutorial part1, introduction to sparkr

Big Data & R

DataFramesVisualization

Libraries Data

+

Page 3: Use r tutorial part1, introduction to sparkr

Big Data & R: ChallengesData access HDFS, Hive Capacity

Single machine memory

ParallelismSingle Thread

Page 4: Use r tutorial part1, introduction to sparkr

Apache SparkEngine for large-scale data processing

Fast, Easy to Use

Runs EverywhereEC2, clusters, laptop etc.

Page 5: Use r tutorial part1, introduction to sparkr

Speed

Scalable

Flexible

Statistics

Visualization

DataFrames

SparkR

Page 6: Use r tutorial part1, introduction to sparkr

Big Data & R: PatternsBig Data Small Learning Partition

AggregateLarge ScaleMachine Learning

Page 7: Use r tutorial part1, introduction to sparkr

1. Big Data, Small Learning

DataCleaningFilteringAggregat

ion

Collect

SubsetDataFramesVisualizationLibraries

Page 8: Use r tutorial part1, introduction to sparkr

1. Big Data, Small Learningsongs <- read.df(

“songs.json”,“json”)

newSongs <- filter( songs, songs$year > 2000)

ggplot(collect(newSongs))

DataCleaningFiltering

Aggregation

Collect

Subset

Page 9: Use r tutorial part1, introduction to sparkr

2. Partition Aggregate

Data Best Mode

lParam

s

Parameter Tuning

Page 10: Use r tutorial part1, introduction to sparkr

params<-c(1e-3,1e-1,1e2) data <- read.csv(“t.csv”)

train <- function(prm) { lm.ridge(“y ~ x+z”, data, prm)}

lapply(params, train)

2. Partition Aggregate

DataBest Model

Params

Page 11: Use r tutorial part1, introduction to sparkr

3. Large Scale Machine Learning

Data Featurize Learning Model

Page 12: Use r tutorial part1, introduction to sparkr

3. Large Scale Machine Learning

Data Featurize Learning Model

training <- read.csv(“t.csv”)

model <- glm(delay~Distance+Des

t,family =

“gaussian”,data=data)

summary(model)

Page 13: Use r tutorial part1, introduction to sparkr

Big Data & RBig Data Small LearningPartitionAggregateLarge ScaleMachine Learning

SparkR:Unified approach

Page 14: Use r tutorial part1, introduction to sparkr

SparkR DataFramespeople <- read.df( “people.json”, “json”)

avgAge <- select( df, avg(df$age))

head(avgAge)

Number of data sources

Column Functions, SQL

Support for R UDFs

Page 15: Use r tutorial part1, introduction to sparkr

Large Scale Machine Learning

Integration with MLLib

Key FeaturesR-like formulas

Model statistics

model <- glm(a ~ b + c,

data = df)

summary(model)

Page 16: Use r tutorial part1, introduction to sparkr

Partition Aggregatespark.lapply: Simple, parallel

API Ex: Parameter tuning, Model

Averaging

Include existing R packages

Page 17: Use r tutorial part1, introduction to sparkr

SparkR StatusOpen source -- Part of Apache Spark

> 60 committers from UC Berkeley, Databricks, IBM, Intel, Alteryx etc.

Contributions welcome !

Page 18: Use r tutorial part1, introduction to sparkr

Tutorial Outline Part 1: Data Exploration• ETL: Data loading, schema • Exploration: Filter, clean, aggregate

etc.• Visualization: Integration with ggplot

Part 2: Advanced Analytics (After the break)

Page 19: Use r tutorial part1, introduction to sparkr

Tutorial Setup

Each user gets a dedicated micro cluster• Cluster is terminated after 1 hour of inactivity• Multiple users can collaborate on a notebook

Notebooks can be exported/imported Examples and tutorials in R/Python/Scala

Free online service for learning Apache Spark

Page 20: Use r tutorial part1, introduction to sparkr

Tutorial SetupDatabricks Notebooks • Interactive workspace•Markdown + R, Python, Scala, SQL

Sign up at http://databricks.com/ce

Page 21: Use r tutorial part1, introduction to sparkr

Tutorial Setup

Fill out our survey at

tiny.cc/sparkr-user-survey

Page 22: Use r tutorial part1, introduction to sparkr

SparkRBig data processing from R

DataFrames for ETL, data exploration

Support for advanced analytics

Page 23: Use r tutorial part1, introduction to sparkr

Tutorial Next StepsSign up at http://databricks.com/ce

Part 1: tiny.cc/sparkr-tutorial-part1

Fill out our survey at tiny.cc/sparkr-user-survey