36
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Khuluod Odeh Alex C. Engler June 20, 2016 Apache Spark Clusters for Everyone Easy Access to Amazon EMR Spark Clusters Using R and Python in the Browser

Apache Spark Clusters for Everyone | AWS Public Sector Summit 2016

Embed Size (px)

Citation preview

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Khuluod OdehAlex C. Engler

June 20, 2016

Apache Spark Clusters for EveryoneEasy Access to Amazon EMR Spark Clusters

Using R and Python in the Browser

Khuloud OdehVP of IT and CIO [email protected]@kodeh

Our mission is to open minds, shape decisions, and offer solutions through economic and social policy research.

Data driven evaluation informing public policy decision-makingTelling stories through interactive data visualizationsSimulating policies before they happen using Micro-simulation models

Extend and strengthen research

Raise visibility & influence

Grow revenue & flexible resources

Improve productivity &

work experience

High performance computing

Machine learning

Big data

Web and social

Data visualizations

Outreach

Automated process

Business intelligence

CRM

User-centric systems

Post-modern ERP

Mobility

not simply modernization

Transformation

Smarter FasterBetter

Our IT challenge

Publication managementData analytics & visualization

Statistical programmingMicrosimulation and modeling

Cloud-firstScalable & elastic architecture

Our technology strategy

Urban website today

CMS, publications management and DataViz in the cloud

BeforeToday

Modeling in the cloud architecture concept

Sharedfolder NoSQL

Web Server

R CPU

Web Server

R CPU

Web server

R CPU

Auto Scaling

Queue

Networked data repository

Web server

Queue

Web Server

R CPU

Web Server

R CPUHadoop

Compute clusters

Slow

Fast

RDBMSAnalysis server

Me:

• Alex, hi.

• Data Scientist @ Urban Institute

• Professor of Data (Viz & Science) + Public Policy @

• Georgetown University• Johns Hopkins University

Our problem

1. Big data

2. Small budget

3. Advanced statistics

4. Programming limitations

Our solution

Apache Spark +

R/Python IDEs in the browser+

Amazon Elastic MapReduce

Our solution

Apache Spark• Distributed memory framework made for big data

• Written in Scala/Java, but has R and Python APIs

• Good and improving statistical methods

• Open source

Our solution

R/Python IDEs in the browser• SparkR + pySpark : Researchers can work in

relatively familiar languages

• RStudio / Jupyter Notebooks offer interactive development environment

Our solution

Amazon Elastic MapReduce (Amazon EMR)• Elastic – Only pay for clusters that are turned on

• Fast – 12-15 minute cluster spin up

• Free unlimited data transfer from S3

Our solution

Our solution

New problem – How do researchers use this?On a Windows computer, this requires:• Create an AWS account• Create a PPK file• Install AWS CLI• Use Linux command line for EMR bootstrap• Install/use Putty for SSH• Use FoxyProxy for port forwarding

Linux command line

Our solution

Automate many processes w/ Python script

Alternative: AWS CloudFormation

• Specify  and launch AWS resources with a JSON template

• Obviates the need for local installations on researcher machines

• Still meets security requirements

AWS CloudFormation – specify parameters

AWS CloudFormation – specify resources

Applying SparkR

HFPC mortgage default data

• 800 million rows and growing to an expected 2-3 billion;

• 1 TB of data

• Requires • data manipulation• logistic regression (must be deterministic)

EMR cluster

• Four R3.8x Large EC2 instances:• 244 GB memory Each (~1 TB total);• 32 vCPU each (128 total);• 1296 GB SSD storage.

• $2.66/hour each, or ~$10/hour total

RStudio in the browser

Setting up a Spark context

Reading in data from S3

Dplyr–like Syntax

Data Visualization with ggplot2

Logistic Regression

Speed ComparisonsSAS on server• Urban Institute

local SAS server• Single-threaded

HP SAS on EC2 cluster• High performance

SAS• On a Cloudera

Hadoop cluster optimized by SAS professionals

SparkR on EMR• Spark + R• On EMR• Optimized

(maybe) by me

Speed Comparisons

Data import Sort Merge Simple aggregations

Logistic regression

SAS on a server 58 44 26 125 306

High performance SAS on a Hadoop cluster

29 13 3 1 9

SparkR on EMR 6 6 2 3 8

All times in minutes

Conclusion

Spark + R/Python + Amazon EMR w/ CloudFormation

Solves a very tough problem:• Scalable to huge datasets• Low cost• Very fast• Completely elastic• Accessible• Robust statistics

Thank you!

GitHub Page:https://github.com/UrbanInstitute/spark-social-science

e-mail:[email protected]

Twitter: @alexcengler